Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max 5000 artworks scraped per artist #30

Open
neenkah opened this issue Nov 13, 2023 · 2 comments
Open

max 5000 artworks scraped per artist #30

neenkah opened this issue Nov 13, 2023 · 2 comments

Comments

@neenkah
Copy link

neenkah commented Nov 13, 2023

Hi @modhurita,

Only the first 5000 artwork urls get saved to works.txt for artists with many artworks (photographers) such as Gordon Parks, Alfred Eisenstaedt or Carl Mydans.

Scraping all of these would be incredibly time demanding, so it might be nice to provide an adjustable limit.
I also noticed when scrolling through the artworks index to gather the urls, the page gets incredibly slow due to loading all the artworks. It might be nice to sort by date and scroll through every two years (with the url ending in: &date=1954).

@modhurita
Copy link
Contributor

modhurita commented Nov 23, 2023

Hi @neenkah, thanks for bringing this to our attention.

Only the first 5000 artwork urls get saved to works.txt

Do you know what exactly happens when you hit the 5000 artworks limit while scraping? Does the right arrow disappear, is it no longer clickable, or something else entirely?

Scraping all of these would be incredibly time demanding, so it might be nice to provide an adjustable limit.

What exactly do you mean? Do you mean that we should scroll through to (much) fewer than 5000 artworks at a time?

It might be nice to sort by date and scroll through every two years (with the url ending in: &date=1954).

Collecting the artworks by year is a good suggestion. However, an artist like Alfred Eisenstaedt with ~200,000 paintings might well have >5000 artworks / year in their most prolific years.

@neenkah
Copy link
Author

neenkah commented Nov 27, 2023

Hi @modhurita,

I have never seen the moment 5000 artworks are scraped, so I cannot provide any details about that.

About the adjustable limit, I indeed meant fewer artworks in the case that someone using the scraper might prefer to have coverage of all artists, but not all artworks within a limited scrape timeframe. I think it could be a nice feature to have, but it might not be feasible to check this whilst scraping...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants