You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?
For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?
The text was updated successfully, but these errors were encountered:
It is a good idea to partition the data into multiple machines and build separate indexes. In the example you gave, the result from 10 machines is going to be better than looking at only one machine, if you combine the results and rank them by estimated/exact Jaccard similarity.
If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?
For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?
The text was updated successfully, but these errors were encountered: