-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up collaborative pinning clusters #68
Comments
@hsanjuan are you able to provide some thoughts on this? My hope is to automate things so collaborative cluster so it gets updated every time |
@lidel it's pretty simple. If you have a machine with the data pinned in ipfs, you can add the ipfs-cluster. You need ipfs-cluster-service and ipfs-cluster-ctl. The cluster needs to be initiated:
then the cluster service needs to run continuously as daemon:
Note that this machine is by default the only machine able to change the data in the cluster (trusted peer). You can add other trusted peers later. You can now go ahead and pin a CID to the cluster:
There are command options avaible which defines how often the data should be pinned in the cluster etc. I think in the end we shard the data in the cluster, meaning that the content is split into multiple pins and this pins will get a minimum and maximum number of replications. |
When we add the current data, we should change the chunker to rabin, which allows for diff updates. We can also set the --expire-in flag, to unpin old versions of pages and files automatically, like after a year? If someone likes to archive the older versions, than this can be done by pinning them in IPFS. In the future we could modify the pages a bit and offer a bit of a history, like adding a link to the previous version of the file. When we update to a new version, we just add a static link which will lead to the version before. When the cluster has automatically unpinned the file they remain available until the last IPFS node has unpinned them and cleared it out of the cache. The most efficient way to update pins is by adding the new version with IPFS, and then trigger an update of the cluster with We just have to consider that certain files or pages will not be updates for a long time:
|
Cluster setup should better wait a while, until the development of IPFS has advanced a bit: If we get a bunch of people together to be part of the project, it's best to have already all features we like to have at this point, avoiding that people have to upgrade or we to restore the already published informations. @lidel hope you share the opinion :) |
go-ipfs 0.5.0 is going to have a lot of changes in general, so waiting at least for that is probably sensible. |
@lordcirth can Badger 2.0 be compacted when running an ipfs-cluster? Since the cluster has to unpin older versions after a while for storage considerations. But everyone can pin the old versions and keep them alive in the network. (that's my current mindset - others might think differently about this!) |
You mean garbage collection? I expect it should behave pretty much like Badger 1.0 did from the perspective of the user, and have all the same features. |
Well, no - the compaction of badger itself below IPFS. Since for v1 there was some issues reported, for example this one: |
@lidel since these are BIG (how big btw?), I'll need to deploy a big machine. Then we can automate adding the pin from the CI on this repo. I'll let you know |
@hsanjuan we could add each page/picture as individual pin with a minimum and maximum amount of replications. This way not each cluster member need to hold the full amount of data and for each new pin the data is stored on the cluster members which got the most free space available. The en version of 2017 is just shy of 1 TB (don't know the exact number, but it's quite big). If you consider the grow rate and that we want to hold the 13 languages with the most articles (see #63 ) we need considerably more storage as a single machine can easily hold (servers with a large amounts of harddrive slots gets pretty expensive fast). |
Unfortunately, on a collaborative cluster this poses a lot of problems: participants are not trusted to actually be holding the content, not stable etc... |
True. But I expect that people who are willing to share space and traffic for such a project will do this because they want the project to succeed. The cluster will always try to replicate more copies if there are not pin-max available, so if there are some machines going in and out, there's no issue at all. (As far as I understand this functionality). Especially if we list some minimum requirements I think many people will join such a project, as long as we don't require like 20 TB storage. This will greatly increase the distributed character of this project and would fit exactly what Wikipedia likes to achieve: Everyone can participate. Suggestions for minimum requirements: Additionally we need to organize the data pinned in the cluster in files/folders anyway, to make them accessable via paths rather than cids. If you like to replicate a full copy, you're still able to do this, just locally pin the root folder of the language you like to cache. |
My idea is to have a separate cluster per language to start with. Many languages are small enough for anyone to keep a copy.. |
This might result in poor availability of certain languages of not so well developed countries. I would prefer to do a single setup, this way we reduce the setup time/maintenance time we can add additionally languages when we're completed testing the system, without major changing on the follower systems. Additionally the same media files are often used in different language versions. We can avoid storing them individually creating just a single cluster. |
Ok, another question here is, if we pinned each object individually, how many objects would that be? |
One per Page, one per Category, one per Image I guess. Doesn't really matter, we're currently doing the same thing with the static mirrors. The difference is that each file will be pinned individually instead of by reference to be able to run 'ipfs-cluster-ctl pin update' to push the a new version of an article to the same nodes who stored it before for efficient deduplication. Best case scenario would be to get the latest edits from the special page of Wikipedia and generate the new version and push it to the cluster. Or if we want less updates just get the list and process it after 24h. This way we can merge articles edits which has been done within 24h to reduce the update rate. The new folder CID can be pushed to the ipns after we we completed our list of updates. Old versions can be marked with a pin timeout, to allow access to older versions for a while (depending on how much space we have). |
How well will ipfs-cluster handle millions of separate pins? Has this been tested? |
@lordcirth we're here to test this, I guess. I've currently working on a different cluster, after this work has been finished I thought about looking into setting this one up. |
While cluster might break with millions of individual pins (particularly on crdt mode), probably ipfs has its own share of troubles with that. Therefore I'm a bit reticent... |
@hsanjuan well, how could we otherwise get this to work? 🤔 I mean IPFS can hold the current mirrored version, which are also quite a lot of files. I just thought about using the inline feature of IPFS and crank the limit up to avoid referencing for small files. |
@RubenKelevra by having a single root hash pinned |
So we need to modify a MFS when stuff is updated and then pin the new root recursively? Wouldn't this lead to a single pin which is over a terabyte, which needs to be hold by every cluster member completely? 🤔 |
Well yes. But it would work. The other ways will likely not work. But nevertheless, having a look at the numbers and at what is the way to add the wikipedia in the most compact form would awesome. |
How about running a CRC32 over the article names/file names and using the first 4 numbers to make subfolders. This way we chunk the data into ~65k pieces. If we're talking about 10 TB each of those pins would end up using 160 MB. Much better to spread around. We probably still want to limit the concurrent pins to 1 or 2, to avoid extremely long pin downloads to cluster members. |
Yeah, that sounds better. Still a cluster with a positive replication factor ("only replicate in X places") it is easily abusable by anyone reporting a lot of space for example, which will get allocated the content only to perhaps not pin the content at all. Unlike filecoin, cluster/ipfs cannot ensure the content is actually present in the places that claim to have it. If you however have a level of trust in the participants (and they are stable enough), then that problem disappears. |
I got an idea how to solve this without THAT much effort. Should I open a ticket on ipfs-cluster repo? Edit: I've wrote the idea down in a ticket. Hope you like it @hsanjuan. My ideas tend to be somewhat lengthy... it's hard for me to be very precise AND short in English, since it's not my native language. Hope you don't mind. |
I've looked again into this topic. IPFS can store git objects, how about using this ability? The most compact form would be to store the raw text of the article plus the true Title plus the URL. This way we could not only preserve the articles but also the complete history including the authors (which is in my opinion worth the additional work - to give proper credit to them). This would also allow us to update the mirror with a minimal amount of changes since we're only adding some git objects to IPFS. To allow for more easy access I would recommend storing the latest version of the articles as text file in the ipfs as well. This means the browser accessing the ipfs mirror, would only need to understand a text resource instead of accessing a git. The question comes to mind, how do we render the wiki texts then? Well, we could use the abandoned javascript-based editor/parser from the MediaWiki team and fork it to enable it to run it directly in a browser. https://github.com/wikimedia/parsoid-jsapi This way the page would be rendered by the browser and the data would be received from a text file from the ipfs. This is similar to a what-you-see-is-what-you-get editor for Wikipedia, so the browser should be able to handle the conversion. Additionally, there are media files. I'm not sure if you're aware, but there's a new JPG standard on the way, which might be very interesting for this project. It's called JPEG-XL. JPEG-XL can store JPG files natively and compress them further without any loss in quality (since it's transparent) Demo here. But much more interesting, we could transcode the JPG/PNG/GIF files from Wikipedia in JPEG-XL in the future. This would allow us to save just one big file, but receive just the first small parts of it to get smaller resolutions for the embedding in the articles. It supports RAW files natively up to 32 bit per channel, alpha channels, CMYK and even crazier stuff. It can also save animations efficiently and decode them with a lower resolution with a partly download. The format can store lossy and lossless, depending on your needs and create much smaller files than JPG/WebM/GIF/PNG in all cases. Since JPEG-XL is backed up by Google, I expect that it lands in Chromium/Chrome pretty soon. Using JPEG-XL would reduce our storage requirements in the cluster to a degree that saving all images used in the articles in native resolution (or at least a large resolution like 4k/8k would come in reach. This way we can support that pictures can be clicked to enlarge them to the largest resolution we store, while we use the same file (but just the first part) to render the smaller size of the picture in the article. All right, opinions? :) Best regards Ruben Edit: forgot the link to parsoid-jsapi |
Alright folks, let's revisit this now that we have updated Turkish (#60) and English is WIP. @hsanjuan added Wikipedia to https://collab.ipfscluster.io and for now it will only have Turkish one, but when I manage to generate latest English, the list will grow. I suggest we solidify some policies around this. Here is my proposal:
Does this sound sensible? Anything I've missed/misrepresented? |
It sounds sensible. We can re-adjust policies later if we decide so.
How many difference would there be between snapshots? It may be cheap to just let old ones pinned for a while. As policy, 1 month sounds good though. |
Sadly I have no metrics on dedup of unpacked data.
|
Closing this as we now have Wikipedia section at https://collab.ipfscluster.io @hsanjuan we recently updated |
ipfs-cluster 0.12.0 shipped collaboratiove clusters, and there is a website with some demo datasets:
TODO
snapshot-hashes.yml
The text was updated successfully, but these errors were encountered: