-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented BLEU score, wrote unit tests and documentation for it. #1006
base: main
Are you sure you want to change the base?
Conversation
Hi @kadamrahul18! try:
import nltk # we won't add nltk as a package dependency, but we can add it to a separate requirements file for unit tests
except ImportError:
nltk = None
...
class BLEU:
def __init__(...):
if nltk is None:
raise ImportError("`nltk` library is required for BLEU score calculation, please install it via `pip install nltk`") Under the hood of the metric implementation you can use That way we'll be able to have a stable implementation and avoid a big chunk of mathematical code (which is almost always hard to read and easy to break :) ) |
Hi @alexkuzmik, thanks for the feedback and suggestion! I've updated the code to use nltk's implementation instead of a custom one. Please take a look at the updated code and let me know if you have any further suggestions or if anything is unclear. I appreciate your thorough review and helpful recommendations! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you @kadamrahul18!
After delegating most of the algorithm part to nltk
it's now significantly easier to review the PR :)
I left my comments.
########################################################################### | ||
# CORPUS-LEVEL BLEU | ||
########################################################################### | ||
def score_corpus( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric class should only have score
method because it is a required part of the class API.
(score
method is called under the hood of evaluate
pipelines, for example).
score_corpus
function can be invoked only if the metric is used manually which is inconsistent with what Opik is trying to achieve with its metrics library.
I suggest implementing SentenceBLEU
and CorpusBLEU
metrics with just a score
method (all inside bleu.py
).
That way it will be possible to use each of them in manual and automatic evaluation flows which is essential for Opik's evaluation approach.
reason=f"Sentence-level BLEU (nltk, method={self.smoothing_method}): {bleu_value:.4f}", | ||
) | ||
|
||
########################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have these comments.
""" | ||
return getattr(self._nltk_smoother, self.smoothing_method, self._nltk_smoother.method0) | ||
|
||
def _truncate_weights(self, candidate_len: int) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type should be more precise, e.g. Tuple[float, ...]
:param name: Name for this metric instance. | ||
:param track: Whether or not this metric is tracked (depends on your system). | ||
:param n_grams: Up to which n-gram order to use (1 through n_grams). | ||
:param smoothing_method: One of NLTK's SmoothingFunction methods (e.g., "method0", "method1", "method2", etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a link in the docstrings https://www.nltk.org/api/nltk.translate.bleu_score.html#nltk.translate.bleu_score.SmoothingFunction
so that people could understand those options easier.
|
||
try: | ||
import nltk | ||
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to keep the namespace clean and readable, we import modules, not names.
So that this code needs to be turned into smth like
from nltk.translate import bleu_score as nltk_bleu_score
# and then later use the required API like that
nltk_bleu_score.SmoothingFunction
nltk_bleu_score.sentence_bleu
nltk_bleu_score.corpus_bleu
It is usually significantly easier to read the code when it explicitly tells where is the function or the class from :)
) | ||
|
||
# Determine the largest candidate length | ||
max_len = max(len(c) for c in all_candidates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c -> candidate
@@ -1,9 +1,11 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to add nltk
to tests/unit/test_requirements.txt
, or these unit tests won't work in our CI
|
||
self._nltk_smoother = SmoothingFunction() | ||
|
||
def _get_smoothing_func(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type is missing
self, output: str, reference: Union[str, List[str]], **ignored_kwargs: Any | ||
) -> score_result.ScoreResult: | ||
""" | ||
Computes a single-sentence BLEU score using nltk.translate.bleu_score.sentence_bleu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a public api method, please follow the same docstring format as in our other metrics.
reason="Mismatch: number of candidates != number of references.", | ||
) | ||
|
||
all_candidates = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add type hints to these lists. It might be very helpful to understand the logic below (because of the fact that sometimes some values can be lists of strings and sometimes just strings).
BLEU Metric Implementation:
Unit Tests:
Integration with Evaluation Framework:
Documentation:
Testing:
Request for Review:
Please review the following aspects of this pull request:
Any feedback or suggestions for improvement are greatly appreciated.