Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot download and parse GEO files #62

Open
Mengflz opened this issue Jul 10, 2020 · 11 comments
Open

cannot download and parse GEO files #62

Mengflz opened this issue Jul 10, 2020 · 11 comments

Comments

@Mengflz
Copy link

Mengflz commented Jul 10, 2020

After I downloaded Series Matrix File(s), GEOparse.get_GEO function can't work and show there isn't series.
image

So I try to use GEOparse.get_GEO function to download files from website. It turned out that.
image
It seems like url is wrong.

@guma44
Copy link
Owner

guma44 commented Jul 13, 2020

@Mengflz
Hi, could you give complete snippet so I could look at it?

@Mengflz
Copy link
Author

Mengflz commented Jul 13, 2020

image
image
Here is the error message. When I download GSE52562,I got some problems. This dataset can be downloaded from GEO directly.

@guma44
Copy link
Owner

guma44 commented Jul 13, 2020

Could you also share version of python and GEOparse? For me it is working without any problems.

@Mengflz
Copy link
Author

Mengflz commented Jul 14, 2020

My GEOparse is version 2.0.1, and my python version is 3.6.8

@daniwelter
Copy link

Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage
Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet.
I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either.
I use GEOparse v2.0.1 and python 3.8.

@carlosvega
Copy link

carlosvega commented Feb 12, 2021

As a workaround you can do export GEOPARSE_USE_HTTP_FOR_FTP=yes before running your code.

@daniwelter and I tested the very same code with a virtualenv and the same packages, in Mac OS. In my case Big Sur 11.1. So no clue what's happening.

Packages I used:

certifi==2020.12.5
chardet==4.0.0
GEOparse==2.0.3
idna==2.10
numpy==1.20.1
pandas==1.2.2
python-dateutil==2.8.1
pytz==2021.1
requests==2.25.1
six==1.15.0
tqdm==4.56.2
urllib3==1.26.3

For me it was working with ftp but above's fix solved the issue for daniwelter.

This leads me to think that there is an issue on _download_ftp function. Perhaps with how the total_size is calculated on the _download_ftp function. The _download_http function takes the size from the headers but I don't see any reason why the total_size check would be different in those two functions, everything looks normal. What if len(data)is bigger than what f.write(data) returns? (write returns the number of characters written). For example, if data contains any weird character that won't be written later such as multiple EOF or EOS characters. In my case it works so I can't reproduce the issue, so is just a guess. Maybe any underlying network lib is making weird stuff with the packages.

My MTU seems fine (I know some VPNs mess around with MTU), I can't think about any other differences.

networksetup -getMTU en0
Active MTU: 1500 (Current Setting: 1500)

@robertcv
Copy link

I have the same problem with GSE39582. I don't think it's GEOparser's fault because I am having the same problem with manually downloading the SOFT file. Using Firefox or wget I am constantly getting different sized files. Using HTTP instead of FTP (export GEOPARSE_USE_HTTP_FOR_FTP=yes) solves the issue.

@guma44
Copy link
Owner

guma44 commented Feb 18, 2021

Hi, sorry for not replying for long time. Some reason might be that you (might be) behind some corporate proxy. The option GEOPARSE_USE_HTTP_FOR_FTP=yes was introduced because FTP did not want to work with Travis CI. I will check the functions that @carlosvega mentioned but it is hard to debug these issues as for me everything is working.

@carlosvega
Copy link

Yes, but I was using the same VPN as @daniwelter but for me it was working, my guess is that is some network issue… if it doesn't work on the browser then is not your code fault. But maybe you could add the GEOPARSE_USE_HTTP_FOR_FTP as argparse option or as failover for FTP.

@daniwelter
Copy link

@carlosvega @guma44 I wasn't using a VPN at all but the FTP > HTTP switch worked for me either way.

@CholoTook
Copy link

Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage
Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet.
I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either.
I use GEOparse v2.0.1 and python 3.8.

I saw this sometimes, but sometimes it went away... Not sure the problem, I suspect it's NCBI's side dropping connections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants