Misc `SimilarityIndex.add_to_index()` issues #169

allenleetc · 2024-01-09T16:19:48Z

Minor issues for similarity indexes when add_to_index() is called with overlapping/existing sample_ids.

I believe the issues are in fob.internal.core.utils.add_ids() so may extend to all backends. That said I am testing with mongodb backend

Dataset Prep (currently must run over direct connection to MongoDB Atlas for `mongodb` backend

# Create a dataset that is half-indexed
dsn = 'quickstart-test-similarity-image-8'
if fo.dataset_exists(dsn):
    fo.delete_dataset(dsn)
    
dataset = foz.load_zoo_dataset("quickstart",max_samples=8,dataset_name=dsn)
dataset.persistent = True
si = fob.compute_similarity(
   dataset[:4],
   model='clip-vit-base32-torch',
   embeddings='embs_half',
   brain_key='bk_half',
   backend='mongodb'
)
print(si.total_index_size) # returns 4

Issue 0, "last" sample_id to be added already exists

Apparent issue here, maybe use max(jj) rather than last element of jj

fiftyone-brain/fiftyone/brain/internal/core/utils.py

Line 484 in 62da0a6

m = jj[-1] - n + 1

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')

ids_add = ds[-2:].values('id')
ids_add.append(ds.first().id)
print(ids_add)
print(ds.values('id'))
d = 512
embs_add = np.random.rand(len(ids_add),d)
embs_add = embs_add.tolist()
si.add_to_index(embs_add, ids_add) # errors IndexError: index 4 is out of bounds for axis 0 with size 4

For reference, docstring from `SimilarityIndex.add_to_index`

            overwrite (True): whether to replace (True) or ignore (False)
                existing embeddings with the same sample/label IDs
            allow_existing (True): whether to ignore (True) or raise an error
                (False) when ``overwrite`` is False and a provided ID already
                exists in the
            warn_existing (False): whether to log a warning if an embedding is
                not added to the index because its ID already exists

Issue 1, overwrite=False, allow_existing=False. Expect error here

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add)) # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, allow_existing=False) # this works/succeeds silently

Issue 2, overwrite=False, warn_existing=True. Expect warning here


dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = ds[:2].values('id')
print(set(si.sample_ids).intersection(ids_add))  # shows two sample ids
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=False, warn_existing=True) # works/succeeds silently

Issue 3, overwrite=True, allow_existing=False. Don't expect error here as overwrite=True

dsn = 'quickstart-test-similarity-image-8'
ds = fo.load_dataset(dsn)
si = ds.load_brain_results('bk_half')
ids_add = [ds.first().id, ds.last().id]
print(set(si.sample_ids).intersection(ids_add))
embs_add = ds.select(ids_add).values('embs_half')
si.add_to_index(embs_add, ids_add, overwrite=True, allow_existing=False) # errors

The text was updated successfully, but these errors were encountered:

allenleetc · 2024-01-09T16:31:24Z

Hey @brimoor here are some cases of what I mentioned in Slack

brimoor · 2024-01-09T17:40:51Z

@allenleetc thanks for the examples 👍

Some initial reactions:

issue 0: aren't some of the embeddings you're passing in None? If I follow exactly the steps you're showing you never computed embeddings for the second half of ds. Not sure that any of the backends were designed to handle None embeddings
clarification: overwrite=True/False deals with embeddings, while allow_existing=True/False deals with IDs. Perhaps it would help to clarify that overwrite=True/False only has an effect when allow_existing=True? When allow_existing=False there will always be an error when existing IDs are found. When allow_existing=True, overwrite=True means to update existing embeddings in-place while overwrite=False means to skip existing embeddings

brimoor · 2024-01-09T17:54:27Z

update: what I said about overwrite and allow_existing may not be right. Need to think more carefully about it

allenleetc · 2024-01-09T17:54:39Z

@brimoor
Re issue 0, Oh yes that's true but if I make up some non-None embeddings to use the same error occurs. Example updated to use random embeddings (non-None)

For mongodb.py at least the issue I believe is in fbu.add_ids which occurs before consideration of the actual embeddings

allenleetc self-assigned this Jan 9, 2024

allenleetc added the bug Bug fixes label Jan 9, 2024

allenleetc mentioned this issue Jan 11, 2024

Fixing similarity index bugs #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc `SimilarityIndex.add_to_index()` issues #169

Misc `SimilarityIndex.add_to_index()` issues #169

allenleetc commented Jan 9, 2024 •

edited

Loading

allenleetc commented Jan 9, 2024

brimoor commented Jan 9, 2024 •

edited

Loading

brimoor commented Jan 9, 2024 •

edited

Loading

allenleetc commented Jan 9, 2024 •

edited

Loading

Misc SimilarityIndex.add_to_index() issues #169

Misc SimilarityIndex.add_to_index() issues #169

Comments

allenleetc commented Jan 9, 2024 • edited Loading

Dataset Prep (currently must run over direct connection to MongoDB Atlas for mongodb backend

Issue 0, "last" sample_id to be added already exists

For reference, docstring from SimilarityIndex.add_to_index

Issue 1, overwrite=False, allow_existing=False. Expect error here

Issue 2, overwrite=False, warn_existing=True. Expect warning here

Issue 3, overwrite=True, allow_existing=False. Don't expect error here as overwrite=True

allenleetc commented Jan 9, 2024

brimoor commented Jan 9, 2024 • edited Loading

brimoor commented Jan 9, 2024 • edited Loading

allenleetc commented Jan 9, 2024 • edited Loading

Misc `SimilarityIndex.add_to_index()` issues #169

Misc `SimilarityIndex.add_to_index()` issues #169

allenleetc commented Jan 9, 2024 •

edited

Loading

Dataset Prep (currently must run over direct connection to MongoDB Atlas for `mongodb` backend

For reference, docstring from `SimilarityIndex.add_to_index`

brimoor commented Jan 9, 2024 •

edited

Loading

brimoor commented Jan 9, 2024 •

edited

Loading

allenleetc commented Jan 9, 2024 •

edited

Loading