You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The principle of MultipleNegativesRankingLoss is that within a batch, all samples except for the one corresponding to the positive document are considered negative samples.
In the implementation of this framework, it does not take into account the situation where a corpus corresponds to multiple queries, which may lead to incorrect data. For example:
After processing the above data with the following code:
node_id = dataset.relevant_docs[query_id][0]
text = dataset.corpus[node_id]
example = InputExample(texts=[query, text])
examples.append(example)
It becomes:
[(q1,corpus1),(q2,corpus1),(q3,corpus2)]
Then, after passing through DataLoader(examples, batch_size=batch_size), if they are allocated to the same batch, we will find that:
(q1,corpus1) is a pair of positive samples for q1, but q2 and corpus1 will be considered negative samples. In fact, (q2,corpus1) is also associated because they are both generated by combining corpus1 with LLM.
I hope this helps! Let me know if you have any further questions or if there's anything else I can assist you with.
Version
latest
Steps to Reproduce
just finetune
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered:
The issue you're encountering seems to stem from the way the MultipleNegativesRankingLoss function handles positive and negative samples within a batch. In your dataset, a single corpus is associated with multiple queries, which can lead to incorrect labeling of positive and negative samples when they are included in the same batch.
The current implementation pairs each query with its relevant document, but when these pairs are batched together, the framework may not correctly identify all positive pairs, treating some as negative due to the batch processing logic.
To address this, you might consider implementing a custom batching strategy that ensures all positive pairs are correctly identified within a batch. This could involve modifying the DataLoader to group examples in a way that respects the multiple associations between queries and the same corpus. Unfortunately, the current framework does not provide a built-in mechanism to handle this specific scenario, so a custom solution would be necessary.
If you need further assistance with implementing such a solution, feel free to ask!
class MultipleNegativesRankingLoss(nn.Module):
def init(self, model: SentenceTransformer, scale: float = 20.0, similarity_fct=util.cos_sim):
"""
This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)..., (a_n, p_n)
where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i != j a negative pair.
For each ``a_i``, it uses all other ``p_j`` as negative samples, i.e., for ``a_i``, we have 1 positive example
(``p_i``) and ``n-1`` negative examples (``p_j``). It then minimizes the negative log-likehood for softmax
normalized scores.
Bug Description
As follows is my training code:
The training data is generally as follows:
The principle of MultipleNegativesRankingLoss is that within a batch, all samples except for the one corresponding to the positive document are considered negative samples.
In the implementation of this framework, it does not take into account the situation where a corpus corresponds to multiple queries, which may lead to incorrect data. For example:
After processing the above data with the following code:
It becomes:
[(q1,corpus1),(q2,corpus1),(q3,corpus2)]
Then, after passing through DataLoader(examples, batch_size=batch_size), if they are allocated to the same batch, we will find that:
(q1,corpus1) is a pair of positive samples for q1, but q2 and corpus1 will be considered negative samples. In fact, (q2,corpus1) is also associated because they are both generated by combining corpus1 with LLM.
I hope this helps! Let me know if you have any further questions or if there's anything else I can assist you with.
Version
latest
Steps to Reproduce
just finetune
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: