[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

OKC13 · 2025-01-03T05:40:00Z

Bug Description

As follows is my training code:

train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
embed_model_name = "/llm/pretrain_models/conan-embedding-v1/"
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id=embed_model_name,
    model_output_path="conan-embedding-v1-5star",
    val_dataset=None,
    epochs=3
)
finetune_engine.finetune()

The training data is generally as follows:

{
    "queries": {
        "qid1": "q1",
        "qid2": "q2",
        "qid3": "q3"
    },
    "corpus": {
        "corpus_id1": "corpus1",
        "corpus_id2": "corpus2"
    },
    "relevant_docs": {
        "qid1": [
            "corpus_id1"
        ],
        "qid2": [
            "corpus_id1"
        ],
        "qid3": [
            "corpus_id2"
        ]
    }
}

The principle of MultipleNegativesRankingLoss is that within a batch, all samples except for the one corresponding to the positive document are considered negative samples.

In the implementation of this framework, it does not take into account the situation where a corpus corresponds to multiple queries, which may lead to incorrect data. For example:
After processing the above data with the following code:

node_id = dataset.relevant_docs[query_id][0]
text = dataset.corpus[node_id]
example = InputExample(texts=[query, text])
examples.append(example)

It becomes:
[(q1,corpus1),(q2,corpus1),(q3,corpus2)]
Then, after passing through DataLoader(examples, batch_size=batch_size), if they are allocated to the same batch, we will find that:
(q1,corpus1) is a pair of positive samples for q1, but q2 and corpus1 will be considered negative samples. In fact, (q2,corpus1) is also associated because they are both generated by combining corpus1 with LLM.

I hope this helps! Let me know if you have any further questions or if there's anything else I can assist you with.

Version

latest

Steps to Reproduce

just finetune

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-03T05:41:21Z

The issue you're encountering seems to stem from the way the MultipleNegativesRankingLoss function handles positive and negative samples within a batch. In your dataset, a single corpus is associated with multiple queries, which can lead to incorrect labeling of positive and negative samples when they are included in the same batch.

The current implementation pairs each query with its relevant document, but when these pairs are batched together, the framework may not correctly identify all positive pairs, treating some as negative due to the batch processing logic.

To address this, you might consider implementing a custom batching strategy that ensures all positive pairs are correctly identified within a batch. This could involve modifying the DataLoader to group examples in a way that respects the multiple associations between queries and the same corpus. Unfortunately, the current framework does not provide a built-in mechanism to handle this specific scenario, so a custom solution would be necessary.

If you need further assistance with implementing such a solution, feel free to ask!

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

OKC13 · 2025-01-03T05:55:31Z

class MultipleNegativesRankingLoss(nn.Module):
def init(self, model: SentenceTransformer, scale: float = 20.0, similarity_fct=util.cos_sim):
"""
This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)..., (a_n, p_n)
where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i != j a negative pair.

    For each ``a_i``, it uses all other ``p_j`` as negative samples, i.e., for ``a_i``, we have 1 positive example
    (``p_i``) and ``n-1`` negative examples (``p_j``). It then minimizes the negative log-likehood for softmax
    normalized scores.

logan-markewich · 2025-01-03T15:00:34Z

@OKC13 feel free to make a PR! (Although tbh, it seems like it would be easier to finetune this specific scenario outside of llama-index?)

OKC13 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

OKC13 commented Jan 3, 2025

dosubot bot commented Jan 3, 2025

OKC13 commented Jan 3, 2025

logan-markewich commented Jan 3, 2025

[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

Comments

OKC13 commented Jan 3, 2025

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Jan 3, 2025

OKC13 commented Jan 3, 2025

logan-markewich commented Jan 3, 2025