Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloader: resume interrupted downloads #99

Open
tiborsimko opened this issue Nov 19, 2020 · 3 comments
Open

downloader: resume interrupted downloads #99

tiborsimko opened this issue Nov 19, 2020 · 3 comments
Assignees

Comments

@tiborsimko
Copy link
Member

When one download a big 20 GB file, something can go wrong along the way, or the user can sleep or restart their machines. When this happens and user re-issues download-file command, this will start download again from zero, event though local directory already has parts of the file.

The goal of this issue to resume interrupted downloads from the last good state.

How to reproduce:

  1. Run local CERN Open Data instance as follows:
$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/atlas-2020-exactly2lep.json
$ firefox http://localhost/record/15007

The test record contains a big 20 GB file.

Now start a download and interrupt after a while:

$ cernopendata-client download-files --recid 15007 --server http://localhost
==> Downloading file 1 of 1
  -> File: ./15007/exactly2lep.zip
 
^C
$ ls -lh 15007/exactly2lep.zip
-rw-r--r-- 1 simko simko 15M Nov 19 11:40 15007/exactly2lep.zip

and resume download from scratch again:

$ cernopendata-client download-files --recid 15007 --server http://localhost

The downloader should recognise already available 15007/exactly2lep.zip file and should continue from there.

Note that the user can interrupt download any number of times until successful completion.

The behaviour could be configurable by a new --resume option, for example:

  • when the downloader sees there is no target file yet, it proceeds to download as usual;
  • when the downloader sees the file, it checks it size and checksum, and
    • if the file is complete, it will say that there is nothing to download, that the file is here and is already verified
    • if the file is partial, it would check whether the user used --resume option, and if yes, then would continue from that point, and if no, it would ask the user whether resume or redownload is wanted.
  • Since we have only big files, I guess the resume behaviour could be the default though, and ask the user to do rm ... on the given file if the user would like not to do resume, but rather redownload.

CC @katilp

@ParthS007 ParthS007 self-assigned this Nov 25, 2020
@ParthS007
Copy link
Member

@tiborsimko

if the file is partial, it would check whether the user used --resume option, and if yes, then would continue from that point, and if no, it would ask the user whether the resume or redownload is wanted.

I have a couple of musings.

  1. We will know if a file is partially downloaded when it is not matching the checksum and size from the remote file.

  2. How we plan to handle the resuming of download?
    We will go like requests -> pycurl -> xrootd and use inbuilt functionality from the respective library or you have some different approach in mind?

@ParthS007
Copy link
Member

@ParthS007
Copy link
Member

ParthS007 commented Dec 17, 2020

Resuming of downloads

  1. Requests - downloader: catch more error situations #109
  2. Pycurl - downloader: resume interrupted downloads #117
  3. Xrootd -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants