Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SentenceTransformersFinetuneEngine has a problem with mixed positive and negative samples. #17413

Open
OKC13 opened this issue Jan 3, 2025 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@OKC13
Copy link

OKC13 commented Jan 3, 2025

Bug Description

As follows is my training code:

train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
embed_model_name = "/llm/pretrain_models/conan-embedding-v1/"
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id=embed_model_name,
    model_output_path="conan-embedding-v1-5star",
    val_dataset=None,
    epochs=3
)
finetune_engine.finetune()

The training data is generally as follows:

{
    "queries": {
        "qid1": "q1",
        "qid2": "q2",
        "qid3": "q3"
    },
    "corpus": {
        "corpus_id1": "corpus1",
        "corpus_id2": "corpus2"
    },
    "relevant_docs": {
        "qid1": [
            "corpus_id1"
        ],
        "qid2": [
            "corpus_id1"
        ],
        "qid3": [
            "corpus_id2"
        ]
    }
}

The principle of MultipleNegativesRankingLoss is that within a batch, all samples except for the one corresponding to the positive document are considered negative samples.

In the implementation of this framework, it does not take into account the situation where a corpus corresponds to multiple queries, which may lead to incorrect data. For example:
After processing the above data with the following code:

node_id = dataset.relevant_docs[query_id][0]
text = dataset.corpus[node_id]
example = InputExample(texts=[query, text])
examples.append(example)

It becomes:
[(q1,corpus1),(q2,corpus1),(q3,corpus2)]
Then, after passing through DataLoader(examples, batch_size=batch_size), if they are allocated to the same batch, we will find that:
(q1,corpus1) is a pair of positive samples for q1, but q2 and corpus1 will be considered negative samples. In fact, (q2,corpus1) is also associated because they are both generated by combining corpus1 with LLM.

I hope this helps! Let me know if you have any further questions or if there's anything else I can assist you with.

Version

latest

Steps to Reproduce

just finetune

Relevant Logs/Tracbacks

No response

@OKC13 OKC13 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jan 3, 2025
Copy link

dosubot bot commented Jan 3, 2025

The issue you're encountering seems to stem from the way the MultipleNegativesRankingLoss function handles positive and negative samples within a batch. In your dataset, a single corpus is associated with multiple queries, which can lead to incorrect labeling of positive and negative samples when they are included in the same batch.

The current implementation pairs each query with its relevant document, but when these pairs are batched together, the framework may not correctly identify all positive pairs, treating some as negative due to the batch processing logic.

To address this, you might consider implementing a custom batching strategy that ensures all positive pairs are correctly identified within a batch. This could involve modifying the DataLoader to group examples in a way that respects the multiple associations between queries and the same corpus. Unfortunately, the current framework does not provide a built-in mechanism to handle this specific scenario, so a custom solution would be necessary.

If you need further assistance with implementing such a solution, feel free to ask!

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@OKC13
Copy link
Author

OKC13 commented Jan 3, 2025

class MultipleNegativesRankingLoss(nn.Module):
def init(self, model: SentenceTransformer, scale: float = 20.0, similarity_fct=util.cos_sim):
"""
This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)..., (a_n, p_n)
where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i != j a negative pair.

    For each ``a_i``, it uses all other ``p_j`` as negative samples, i.e., for ``a_i``, we have 1 positive example
    (``p_i``) and ``n-1`` negative examples (``p_j``). It then minimizes the negative log-likehood for softmax
    normalized scores.

@logan-markewich
Copy link
Collaborator

@OKC13 feel free to make a PR! (Although tbh, it seems like it would be easier to finetune this specific scenario outside of llama-index?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants