-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental updates #169
Comments
We've been calling this 'incremental' updates internally. CC'ing @boblannon |
Some context from opencivicdata/scrapers-us-municipal#25 (comment) about how we could scraper smarter:
|
The way I've been doing this on my fork was to do my best to make scrapes idempotent. the goal was for a scrape to be all no-ops if an identical scrape had already been done. I did it mostly by narrowing the It works pretty well, even when I re-run a scrape after having merged people or organizations. |
FWIW; stock Pupa scrapes are more idempotent than incremental scrapers, since we don't have to rely on carying state without full rescrapes to ensure state |
Don't see see how that importer code will reduce the number of pages the scraper will visit. Sorry if I'm being dense. |
@fgregg because the importer needs to be able to take negitive actions too, so we do need to know what the database should look like This is a simple one that's easily solved, but take for example documents attached to a bill -- if we only scrape some of the related documents, we can't tell when we remove a document, since it might be missing since we didn't scrape it, or because it's gone. We can scale that back to full collections, too. That's just an example, not the actual issue. |
I've wanted to do this for years, and I have states I could do this on, so it's not not implemented due to a lack of wanting the feature, is all I'm saying |
I think there's two issues here.
This code seems to be about 2, but I'm talking about 1. These issues are obviously connected but not identical. Having the importer know about the DB makes a lot of sense, it already has too. But for 1, it seems like the scraper also has to know some facts from the DB, and this is what I haven't seen before. |
It's the same issue internally, 1 and 2. Passing something like the last scraped time is trivial, that's not the technical issue behind the end behavior your'e after. The real issue is 2 not 1. |
Oh, okay, so there's no problem with having the scraper access the DB? |
Also, FWIW, the scraper does not currently know about the DB in any way, it just writes to JSON to disk. Decoupling like that has been really great for us in the past. |
If pupa knows it's doing an import, it can talk to the DB. The scraper must never talk to the DB under any conditions. |
so how can I pass the last-scraped-time to the scraper? |
Pupa, if it sees it's doing an import, brings up the Django connection before scrape, and can handle that. The scraper shouldn't. |
okay, so where do I write the query that I want pupa to pass to the scraper? |
We've moved to IRC; link for postarity: https://github.com/opencivicdata/pupa/blob/master/pupa/cli/commands/update.py#L223-L226 |
K. should this be split into a incremental db update issue and an issue for having pupa giving info like last_scraped to the scraper? |
Chicago has a lot of legislation and long legislative sessions (4 years). This means that it currently takes about 48 hours to scrape the site every week.
There are some strategies for a smarter scraping strategy but they all require that the scraper know, to some degree, what's already in the database.
Right now, the scraper doesn't know anything about the database, and that seems like it has been a good and sane thing. How should we proceed?
One idea is for the scraper to maybe hit the OCD API? Is that a looser coupling? Thoughts please!
The text was updated successfully, but these errors were encountered: