Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when creating BLEU object with flores200 tokenizer: HTTP 403 Forbidden #274

Open
williammtan opened this issue Nov 6, 2024 · 2 comments

Comments

@williammtan
Copy link

Description:
I encountered an HTTP 403 error when attempting to create a BLEU object using the flores200 tokenizer in SacreBLEU.

Steps to Reproduce:

  1. Create a BLEU object with tokenize="flores200".
  2. Run the script.

Error Message:

File "/Users/williamtan/Projects/indonesiaku-benchmarking/benchmark.py", line 33, in __init__
    "bleu": BLEU(tokenize="flores200"),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "/Users/williamtan/miniconda3/envs/ai_scientist/lib/python3.11/site-packages/sacrebleu/utils.py", line 430, in download_file
    with urllib.request.urlopen(source_path) as f, open(dest_path, 'wb') as out:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib.error.HTTPError: HTTP Error 403: Forbidden

Cause:
The error occurs because the flores200 tokenizer URL uses a tinyurl link, which is not accessible via urllib.request.urlopen due to HTTP 403 restrictions.

Proposed Solution:
To resolve this issue, update the flores200 URL in /tokenizers/tokenizer_spm.py:

"flores200": {
    "url": "https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model",
    "signature": "flores200",
}

Thank you for your help!

@martinpopel
Copy link
Collaborator

I checked it now (sacrebleu -tok flores200 example.ref < example.trans) and everything works fine for me: SacreBLEU downloads the spm model automatically via the tinyurl link. In stderr logs, I see

sacreBLEU: Downloading https://tinyurl.com/flores200sacrebleuspm to /home/martin/.sacrebleu/models/flores200sacrebleuspm

I even tried

import urllib.request
with urllib.request.urlopen("https://tinyurl.com/flores200sacrebleuspm") as f, open("flores200sacrebleuspm", 'wb') as out:
    out.write(f.read())

and it works fine.

I confirm it works (and downloads the same file) even if I substitute https://tinyurl.com/flores200sacrebleuspm with https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model.

The question is which URL is more stable. The official Flores200 README uses https://tinyurl.com/flores200sacrebleuspm, so I guess that is meant as the permanent link, while its target may change (and the fbaipublicfiles.com link may not work in future).

I am thus a bit reluctant to change the URL in tokenizers/tokenizer_spm.py. That said, if either

  1. there are more users who cannot use the tinyurl link or
  2. if you provide some evidence that the fbaipublicfiles.com link can be considered permanent

then please make a PR and I will accept it.
(In the case "1", we will have to update the url each time the tinyurl alias changes its target).

@martinpopel
Copy link
Collaborator

@williammtan Can you try downloading https://tinyurl.com/flores200sacrebleuspm once again (both with urllib.request.urlopen and another method e.g. wget/curl)?
Can you try that with another tinyurl link?
Maybe you are behind a firewall which blocks the whole tinyurl.com.

Yet another alternative would be to catch the exception when the download fails and try to use e.g. https://unshorten.it/ to get the target URL and try to download that instead, but I don't like such solution much as it adds code not related to sacrebleu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants