-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error running roy harvest -wikidata
#183
Comments
seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with? |
that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ. Some notes on what I wrote to Tyler:
Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data. In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: #178 (PR just needs review and (and fixes) and merging). EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked. |
thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work! |
ah, thanks @EG-tech 🙂 |
|
NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast. |
cc. @thorsted Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below: (I don't think this is the way to go but it's useful to know about) |
@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance? SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
# Return records of type File Format or File Format Family (via instance or subclass chain):
{ ?uri wdt:P31/wdt:P279* wd:Q235557 }.
# Only return records that have at least one useful format identifier
FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.
OPTIONAL { ?uri wdt:P2748 ?puid. } # PUID is used to map to PRONOM signatures
OPTIONAL { ?uri wdt:P1195 ?extension. } # File extension
OPTIONAL { ?uri wdt:P1163 ?mimetype. } # IANA Media Type
OPTIONAL { ?uri p:P4152 ?object; # Format identification pattern statement
OPTIONAL { ?object pq:P3294 ?encoding. } # We don't always have an encoding
OPTIONAL { ?object ps:P4152 ?sig. } # We always have a signature
OPTIONAL { ?object pq:P2210 ?relativity. } # Relativity to beginning or end of file
OPTIONAL { ?object pq:P4153 ?offset. } # Offset relative to the relativity
OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
OPTIONAL { ?provenance pr:P248 ?reference;
pr:P813 ?date.
}
}
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri @thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably? (I can create a test binary too) |
nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. |
Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The But the |
@anjackson there was some explanation of these patterns here ffdev-info/wikidp-issues#24 (comment) via @BertrandCaron that may be helpful? re: the PSD issue, this is why you included the UNION of file format family? did it work? |
@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of |
FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/ |
In NYC this week, I am continuing to see this issue with the patches applied to roy:
siglen 8
siglen 10:
siglen 12:
Not great. I will see if i can generate more logging. |
I think I have it. I have a utility for running wikidata queries like shell scripts using golang here: https://github.com/ross-spencer/spargo I found out the standard query is very quick.
It looks like the hang-up in roy is retrieving user provenance, e.g.
e.g.
Via this Stack Overflow post there's some basic web-scraping good behaviour I could be implementing: https://stackoverflow.com/a/62396996 And if I add more logging to wikiprov which is doing this work, I see that the responses do have
Here it is set to I think there are more than a few things we can do here, as well as a simple "no provenance" option. If we add a sleep and retry based on the retry-after value we can make this pass pretty quickly:
diff --git a/pkg/wikiprov/wikiprov.go b/pkg/wikiprov/wikiprov.go
index 85d6ab8..9c27f26 100644
--- a/pkg/wikiprov/wikiprov.go
+++ b/pkg/wikiprov/wikiprov.go
@@ -24,6 +24,10 @@ import (
"io/ioutil"
"net/http"
"strings"
+
+ "log"
+ "strconv"
+ "time"
)
func getRevisionProperties() string {
@@ -91,6 +95,16 @@ func GetWikidataProvenance(id string, history int) (Provenance, error) {
return Provenance{}, err
}
+ if len(resp.Header["Retry-After"]) > 0 {
+ retry, _ := strconv.Atoi(resp.Header["Retry-After"][0])
+ if retry > 0 {
+ log.Println("------- retrying ---------")
+ log.Println(retry)
+ time.Sleep(time.Duration(retry))
+ return GetWikidataProvenance(id, history)
+ }
+ }
+
const expectedCode int = 200
if resp.StatusCode != expectedCode {
responseError := ResponseError{} If we're a bit nicer to the Wikidata's Wikibase API it's really nice to us so we should be able to do this pretty easily. I'll probably:
And I think this will fix things. Apologies for the time wasted on this, I wasn't even looking in the Wikibase's API direction and didn't think for a second it would be interfering. I'm not sure why I was unable to recreate the issue in my regional virtual machines but maybe the internet connection does still have some impact, e.g. a slower connection maybe has fewer attempts and so doesn't trigger the issue? Will be good to fix this. Hopefully have something up soon! |
I'm trying out the instructions here and am getting the following error/output when trying to run
$ roy harvest -wikidata
to start off:I'm on Ubuntu 20.04 with the latest siegfried release (
1.9.2
), is there something obvious I'm doing wrong? (@ross-spencer?)The text was updated successfully, but these errors were encountered: