Implemented BLEU score, wrote unit tests and documentation for it. #1006

kadamrahul18 · 2025-01-09T04:11:54Z

BLEU Metric Implementation:

Added a new BLEU class under sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py.
Implemented the BLEU algorithm to calculate scores based on n-gram precision between the generated text and a reference text.
Included methods for handling both single sentences and corpus-level scoring.
Implemented smoothing techniques (methods 0, 1, 2, 3 from the Chen & Cherry paper) to address zero n-gram matches.
Added configuration options for n-gram order, smoothing method, and weights.

Unit Tests:

Added comprehensive unit tests in sdks/python/tests/unit/evaluation/metrics/test_heuristics.py to validate the BLEU metric's behavior in various scenarios:

Exact match, partial match, and no match cases.
Empty candidate and reference strings.
Different smoothing methods.
Corpus-level scoring.
Edge cases and error handling.

Integration with Evaluation Framework:

Added the BLEU class to the all list in sdks/python/src/opik/evaluation/metrics/heuristics/init.py to make it discoverable by the evaluate function.

Documentation:

Added a new documentation page for the BLEU metric (bleu.md) in the evaluation/metrics section of the documentation, detailing its purpose and usage.

Testing:

Thorough unit tests have been included to cover different aspects of the BLEU metric implementation, including edge cases and different smoothing methods.
All tests in the Python SDK, including the new tests for the BLEU metric, pass successfully when running pytest tests/ from the sdks/python directory.
pre-commit run --all-files has been executed successfully from the sdks/python directory, ensuring code style and formatting consistency.

Request for Review:

Please review the following aspects of this pull request:

Correctness of the BLEU metric implementation, including n-gram precision, brevity penalty, and smoothing.
Clarity and completeness of the unit tests.
Thoroughness of the documentation.
Adherence to Opik's coding standards and best practices.

Any feedback or suggestions for improvement are greatly appreciated.

alexkuzmik · 2025-01-09T15:12:33Z

Hi @kadamrahul18!
I can see that the code is based on the nltk library implementation, which is one of the most popular libraries when it comes to BLEU score calculation.
I prefer to just use NLTK as well, it's likely not the last heuristic metric we will add, and I don't think that populating the code base with non-trivial mathematical calculations is the right thing to do when there are already pretty stable and specialized tools for that.
What I suggest doing is something like that:

try:
    import nltk  # we won't add nltk as a package dependency, but we can add it to a separate requirements file for unit tests
except ImportError:
    nltk = None

...
class BLEU:
    def __init__(...):
        if nltk is None:
            raise ImportError("`nltk` library is required for BLEU score calculation, please install it via `pip install nltk`")

Under the hood of the metric implementation you can use nltk.translate.bleu_score.sentence_bleu or nltk.translate.bleu_score.corpus_bleu.

That way we'll be able to have a stable implementation and avoid a big chunk of mathematical code (which is almost always hard to read and easy to break :) )

kadamrahul18 · 2025-01-09T20:45:25Z

Hi @alexkuzmik, thanks for the feedback and suggestion! I've updated the code to use nltk's implementation instead of a custom one.

Please take a look at the updated code and let me know if you have any further suggestions or if anything is unclear. I appreciate your thorough review and helpful recommendations!

alexkuzmik

Nice, thank you @kadamrahul18!
After delegating most of the algorithm part to nltk it's now significantly easier to review the PR :)
I left my comments.

alexkuzmik · 2025-01-10T11:59:16Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+    ###########################################################################
+    # CORPUS-LEVEL BLEU
+    ###########################################################################
+    def score_corpus(


Metric class should only have score method because it is a required part of the class API.
(score method is called under the hood of evaluate pipelines, for example).

score_corpus function can be invoked only if the metric is used manually which is inconsistent with what Opik is trying to achieve with its metrics library.

I suggest implementing SentenceBLEU and CorpusBLEU metrics with just a score method (all inside bleu.py).
That way it will be possible to use each of them in manual and automatic evaluation flows which is essential for Opik's evaluation approach.

alexkuzmik · 2025-01-10T11:59:32Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+            reason=f"Sentence-level BLEU (nltk, method={self.smoothing_method}): {bleu_value:.4f}",
+        )
+
+    ###########################################################################


No need to have these comments.

alexkuzmik · 2025-01-10T12:01:26Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+        """
+        return getattr(self._nltk_smoother, self.smoothing_method, self._nltk_smoother.method0)
+
+    def _truncate_weights(self, candidate_len: int) -> tuple:


return type should be more precise, e.g. Tuple[float, ...]

alexkuzmik · 2025-01-10T12:03:45Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+        :param name: Name for this metric instance.
+        :param track: Whether or not this metric is tracked (depends on your system).
+        :param n_grams: Up to which n-gram order to use (1 through n_grams).
+        :param smoothing_method: One of NLTK's SmoothingFunction methods (e.g., "method0", "method1", "method2", etc.).


Let's add a link in the docstrings https://www.nltk.org/api/nltk.translate.bleu_score.html#nltk.translate.bleu_score.SmoothingFunction
so that people could understand those options easier.

alexkuzmik · 2025-01-10T12:07:28Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+
+try:
+    import nltk
+    from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction


In order to keep the namespace clean and readable, we import modules, not names.
So that this code needs to be turned into smth like

from nltk.translate import bleu_score as nltk_bleu_score # and then later use the required API like that nltk_bleu_score.SmoothingFunction nltk_bleu_score.sentence_bleu nltk_bleu_score.corpus_bleu

It is usually significantly easier to read the code when it explicitly tells where is the function or the class from :)

alexkuzmik · 2025-01-10T12:23:25Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+            )
+
+        # Determine the largest candidate length
+        max_len = max(len(c) for c in all_candidates)


c -> candidate

alexkuzmik · 2025-01-10T12:31:18Z

sdks/python/tests/unit/evaluation/metrics/test_heuristics.py

@@ -1,9 +1,11 @@
+import pytest


don't forget to add nltk to tests/unit/test_requirements.txt, or these unit tests won't work in our CI

alexkuzmik · 2025-01-10T12:37:06Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+
+        self._nltk_smoother = SmoothingFunction()
+
+    def _get_smoothing_func(self):


return type is missing

alexkuzmik · 2025-01-10T12:38:02Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+        self, output: str, reference: Union[str, List[str]], **ignored_kwargs: Any
+    ) -> score_result.ScoreResult:
+        """
+        Computes a single-sentence BLEU score using nltk.translate.bleu_score.sentence_bleu.


It's a public api method, please follow the same docstring format as in our other metrics.

alexkuzmik · 2025-01-10T12:39:17Z

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

+                reason="Mismatch: number of candidates != number of references.",
+            )
+
+        all_candidates = []


Please add type hints to these lists. It might be very helpful to understand the logic below (because of the fact that sometimes some values can be lists of strings and sometimes just strings).

Implemented BLEU score, wrote unit tests and documentation for it.

e7ff1dc

kadamrahul18 requested review from a team as code owners January 9, 2025 04:11

kadamrahul18 added 2 commits January 9, 2025 13:46

Merge branch 'main' into bleu-metric

75922af

modified bleu.py to use nltk.translate.bleu_score and rewrote unit tests

69b57e5

alexkuzmik requested changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Implemented BLEU score, wrote unit tests and documentation for it. #1006

kadamrahul18 commented Jan 9, 2025 •

edited

Loading

alexkuzmik commented Jan 9, 2025

kadamrahul18 commented Jan 9, 2025

alexkuzmik left a comment

alexkuzmik Jan 10, 2025 •

edited

Loading

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025

alexkuzmik Jan 10, 2025


		self._nltk_smoother = SmoothingFunction()

		def _get_smoothing_func(self):

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Are you sure you want to change the base?

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Conversation

kadamrahul18 commented Jan 9, 2025 • edited Loading

BLEU Metric Implementation:

alexkuzmik commented Jan 9, 2025

kadamrahul18 commented Jan 9, 2025

alexkuzmik left a comment

Choose a reason for hiding this comment

alexkuzmik Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kadamrahul18 commented Jan 9, 2025 •

edited

Loading

alexkuzmik Jan 10, 2025 •

edited

Loading