You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
I encountered an HTTP 403 error when attempting to create a BLEU object using the flores200 tokenizer in SacreBLEU.
Steps to Reproduce:
Create a BLEU object with tokenize="flores200".
Run the script.
Error Message:
File "/Users/williamtan/Projects/indonesiaku-benchmarking/benchmark.py", line 33, in __init__
"bleu": BLEU(tokenize="flores200"),
^^^^^^^^^^^^^^^^^^^^^^^^^^
...
File "/Users/williamtan/miniconda3/envs/ai_scientist/lib/python3.11/site-packages/sacrebleu/utils.py", line 430, in download_file
with urllib.request.urlopen(source_path) as f, open(dest_path, 'wb') as out:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib.error.HTTPError: HTTP Error 403: Forbidden
Cause:
The error occurs because the flores200 tokenizer URL uses a tinyurl link, which is not accessible via urllib.request.urlopen due to HTTP 403 restrictions.
Proposed Solution:
To resolve this issue, update the flores200 URL in /tokenizers/tokenizer_spm.py:
I checked it now (sacrebleu -tok flores200 example.ref < example.trans) and everything works fine for me: SacreBLEU downloads the spm model automatically via the tinyurl link. In stderr logs, I see
sacreBLEU: Downloading https://tinyurl.com/flores200sacrebleuspm to /home/martin/.sacrebleu/models/flores200sacrebleuspm
I confirm it works (and downloads the same file) even if I substitute https://tinyurl.com/flores200sacrebleuspm with https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model.
The question is which URL is more stable. The official Flores200 README uses https://tinyurl.com/flores200sacrebleuspm, so I guess that is meant as the permanent link, while its target may change (and the fbaipublicfiles.com link may not work in future).
I am thus a bit reluctant to change the URL in tokenizers/tokenizer_spm.py. That said, if either
there are more users who cannot use the tinyurl link or
if you provide some evidence that the fbaipublicfiles.com link can be considered permanent
then please make a PR and I will accept it.
(In the case "1", we will have to update the url each time the tinyurl alias changes its target).
@williammtan Can you try downloading https://tinyurl.com/flores200sacrebleuspm once again (both with urllib.request.urlopen and another method e.g. wget/curl)?
Can you try that with another tinyurl link?
Maybe you are behind a firewall which blocks the whole tinyurl.com.
Yet another alternative would be to catch the exception when the download fails and try to use e.g. https://unshorten.it/ to get the target URL and try to download that instead, but I don't like such solution much as it adds code not related to sacrebleu.
Description:
I encountered an HTTP 403 error when attempting to create a BLEU object using the
flores200
tokenizer in SacreBLEU.Steps to Reproduce:
tokenize="flores200"
.Error Message:
Cause:
The error occurs because the
flores200
tokenizer URL uses atinyurl
link, which is not accessible viaurllib.request.urlopen
due to HTTP 403 restrictions.Proposed Solution:
To resolve this issue, update the
flores200
URL in/tokenizers/tokenizer_spm.py
:Thank you for your help!
The text was updated successfully, but these errors were encountered: