Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kadamrahul18
Copy link

@kadamrahul18 kadamrahul18 commented Jan 9, 2025

BLEU Metric Implementation:

  1. Added a new BLEU class under sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py.
  2. Implemented the BLEU algorithm to calculate scores based on n-gram precision between the generated text and a reference text.
  3. Included methods for handling both single sentences and corpus-level scoring.
  4. Implemented smoothing techniques (methods 0, 1, 2, 3 from the Chen & Cherry paper) to address zero n-gram matches.
  5. Added configuration options for n-gram order, smoothing method, and weights.

Unit Tests:

  1. Added comprehensive unit tests in sdks/python/tests/unit/evaluation/metrics/test_heuristics.py to validate the BLEU metric's behavior in various scenarios:
  • Exact match, partial match, and no match cases.
  • Empty candidate and reference strings.
  • Different smoothing methods.
  • Corpus-level scoring.
  • Edge cases and error handling.

Integration with Evaluation Framework:

  1. Added the BLEU class to the all list in sdks/python/src/opik/evaluation/metrics/heuristics/init.py to make it discoverable by the evaluate function.

Documentation:

  1. Added a new documentation page for the BLEU metric (bleu.md) in the evaluation/metrics section of the documentation, detailing its purpose and usage.

Testing:

  1. Thorough unit tests have been included to cover different aspects of the BLEU metric implementation, including edge cases and different smoothing methods.
  2. All tests in the Python SDK, including the new tests for the BLEU metric, pass successfully when running pytest tests/ from the sdks/python directory.
  3. pre-commit run --all-files has been executed successfully from the sdks/python directory, ensuring code style and formatting consistency.

Request for Review:

Please review the following aspects of this pull request:

  1. Correctness of the BLEU metric implementation, including n-gram precision, brevity penalty, and smoothing.
  2. Clarity and completeness of the unit tests.
  3. Thoroughness of the documentation.
  4. Adherence to Opik's coding standards and best practices.

Any feedback or suggestions for improvement are greatly appreciated.

@kadamrahul18 kadamrahul18 requested review from a team as code owners January 9, 2025 04:11
@alexkuzmik
Copy link
Collaborator

Hi @kadamrahul18!
I can see that the code is based on the nltk library implementation, which is one of the most popular libraries when it comes to BLEU score calculation.
I prefer to just use NLTK as well, it's likely not the last heuristic metric we will add, and I don't think that populating the code base with non-trivial mathematical calculations is the right thing to do when there are already pretty stable and specialized tools for that.
What I suggest doing is something like that:

try:
    import nltk  # we won't add nltk as a package dependency, but we can add it to a separate requirements file for unit tests
except ImportError:
    nltk = None

...
class BLEU:
    def __init__(...):
        if nltk is None:
            raise ImportError("`nltk` library is required for BLEU score calculation, please install it via `pip install nltk`")

Under the hood of the metric implementation you can use nltk.translate.bleu_score.sentence_bleu or nltk.translate.bleu_score.corpus_bleu.

That way we'll be able to have a stable implementation and avoid a big chunk of mathematical code (which is almost always hard to read and easy to break :) )

@kadamrahul18
Copy link
Author

Hi @alexkuzmik, thanks for the feedback and suggestion! I've updated the code to use nltk's implementation instead of a custom one.

Please take a look at the updated code and let me know if you have any further suggestions or if anything is unclear. I appreciate your thorough review and helpful recommendations!

Copy link
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you @kadamrahul18!
After delegating most of the algorithm part to nltk it's now significantly easier to review the PR :)
I left my comments.

###########################################################################
# CORPUS-LEVEL BLEU
###########################################################################
def score_corpus(
Copy link
Collaborator

@alexkuzmik alexkuzmik Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric class should only have score method because it is a required part of the class API.
(score method is called under the hood of evaluate pipelines, for example).

score_corpus function can be invoked only if the metric is used manually which is inconsistent with what Opik is trying to achieve with its metrics library.

I suggest implementing SentenceBLEU and CorpusBLEU metrics with just a score method (all inside bleu.py).
That way it will be possible to use each of them in manual and automatic evaluation flows which is essential for Opik's evaluation approach.

reason=f"Sentence-level BLEU (nltk, method={self.smoothing_method}): {bleu_value:.4f}",
)

###########################################################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have these comments.

"""
return getattr(self._nltk_smoother, self.smoothing_method, self._nltk_smoother.method0)

def _truncate_weights(self, candidate_len: int) -> tuple:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type should be more precise, e.g. Tuple[float, ...]

:param name: Name for this metric instance.
:param track: Whether or not this metric is tracked (depends on your system).
:param n_grams: Up to which n-gram order to use (1 through n_grams).
:param smoothing_method: One of NLTK's SmoothingFunction methods (e.g., "method0", "method1", "method2", etc.).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a link in the docstrings https://www.nltk.org/api/nltk.translate.bleu_score.html#nltk.translate.bleu_score.SmoothingFunction
so that people could understand those options easier.


try:
import nltk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep the namespace clean and readable, we import modules, not names.
So that this code needs to be turned into smth like

from nltk.translate import bleu_score as nltk_bleu_score

# and then later use the required API like that
nltk_bleu_score.SmoothingFunction
nltk_bleu_score.sentence_bleu
nltk_bleu_score.corpus_bleu

It is usually significantly easier to read the code when it explicitly tells where is the function or the class from :)

)

# Determine the largest candidate length
max_len = max(len(c) for c in all_candidates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c -> candidate

@@ -1,9 +1,11 @@
import pytest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add nltk to tests/unit/test_requirements.txt, or these unit tests won't work in our CI


self._nltk_smoother = SmoothingFunction()

def _get_smoothing_func(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type is missing

self, output: str, reference: Union[str, List[str]], **ignored_kwargs: Any
) -> score_result.ScoreResult:
"""
Computes a single-sentence BLEU score using nltk.translate.bleu_score.sentence_bleu.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a public api method, please follow the same docstring format as in our other metrics.

reason="Mismatch: number of candidates != number of references.",
)

all_candidates = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type hints to these lists. It might be very helpful to understand the logic below (because of the fact that sometimes some values can be lists of strings and sometimes just strings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants