Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc SimilarityIndex.add_to_index() issues #169

Open
allenleetc opened this issue Jan 9, 2024 · 4 comments
Open

Misc SimilarityIndex.add_to_index() issues #169

allenleetc opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels
bug Bug fixes

Comments

@allenleetc
Copy link
Contributor

allenleetc commented Jan 9, 2024

Minor issues for similarity indexes when add_to_index() is called with overlapping/existing sample_ids.

I believe the issues are in fob.internal.core.utils.add_ids() so may extend to all backends. That said I am testing with mongodb backend

Dataset Prep (currently must run over direct connection to MongoDB Atlas for mongodb backend

# Create a dataset that is half-indexed
dsn = 'quickstart-test-similarity-image-8'
if fo.dataset_exists(dsn):
    fo.delete_dataset(dsn)
    
dataset = foz.load_zoo_dataset("quickstart",max_samples=8,dataset_name=dsn)
dataset.persistent = True
si = fob.compute_similarity(
   dataset[:4],
   model='clip-vit-base32-torch',
   embeddings='embs_half',
   brain_key='bk_half',
   backend='mongodb'
)
print(si.total_index_size) # returns 4

Issue 0, "last" sample_id to be added already exists

Apparent issue here, maybe use max(jj) rather than last element of jj

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')

ids_add = ds[-2:].values('id')
ids_add.append(ds.first().id)
print(ids_add)
print(ds.values('id'))
d = 512
embs_add = np.random.rand(len(ids_add),d)
embs_add = embs_add.tolist()
si.add_to_index(embs_add, ids_add) # errors IndexError: index 4 is out of bounds for axis 0 with size 4

For reference, docstring from SimilarityIndex.add_to_index

            overwrite (True): whether to replace (True) or ignore (False)
                existing embeddings with the same sample/label IDs
            allow_existing (True): whether to ignore (True) or raise an error
                (False) when ``overwrite`` is False and a provided ID already
                exists in the
            warn_existing (False): whether to log a warning if an embedding is
                not added to the index because its ID already exists

Issue 1, overwrite=False, allow_existing=False. Expect error here

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add)) # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, allow_existing=False) # this works/succeeds silently

Issue 2, overwrite=False, warn_existing=True. Expect warning here


dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add))  # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, warn_existing=True) # works/succeeds silently

Issue 3, overwrite=True, allow_existing=False. Don't expect error here as overwrite=True

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = [ds.first().id, ds.last().id]
print(set(si.sample_ids).intersection(ids_add))
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=True, allow_existing=False) # errors
@allenleetc allenleetc self-assigned this Jan 9, 2024
@allenleetc allenleetc added the bug Bug fixes label Jan 9, 2024
@allenleetc
Copy link
Contributor Author

Hey @brimoor here are some cases of what I mentioned in Slack

@brimoor
Copy link
Contributor

brimoor commented Jan 9, 2024

@allenleetc thanks for the examples 👍

Some initial reactions:

  • issue 0: aren't some of the embeddings you're passing in None? If I follow exactly the steps you're showing you never computed embeddings for the second half of ds. Not sure that any of the backends were designed to handle None embeddings
  • clarification: overwrite=True/False deals with embeddings, while allow_existing=True/False deals with IDs. Perhaps it would help to clarify that overwrite=True/False only has an effect when allow_existing=True? When allow_existing=False there will always be an error when existing IDs are found. When allow_existing=True, overwrite=True means to update existing embeddings in-place while overwrite=False means to skip existing embeddings

@brimoor
Copy link
Contributor

brimoor commented Jan 9, 2024

update: what I said about overwrite and allow_existing may not be right. Need to think more carefully about it

@allenleetc
Copy link
Contributor Author

allenleetc commented Jan 9, 2024

@brimoor
Re issue 0, Oh yes that's true but if I make up some non-None embeddings to use the same error occurs. Example updated to use random embeddings (non-None)

For mongodb.py at least the issue I believe is in fbu.add_ids which occurs before consideration of the actual embeddings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fixes
Projects
None yet
Development

No branches or pull requests

2 participants