Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error running roy harvest -wikidata #183

Open
EG-tech opened this issue Apr 27, 2022 · 15 comments
Open

error running roy harvest -wikidata #183

EG-tech opened this issue Apr 27, 2022 · 15 comments
Labels

Comments

@EG-tech
Copy link

EG-tech commented Apr 27, 2022

I'm trying out the instructions here and am getting the following error/output when trying to run $ roy harvest -wikidata to start off:

2022/04/27 09:23:21 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2022/04/27 09:24:55 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

I'm on Ubuntu 20.04 with the latest siegfried release (1.9.2), is there something obvious I'm doing wrong? (@ross-spencer?)

@EG-tech
Copy link
Author

EG-tech commented Apr 27, 2022

seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with?

@ross-spencer
Copy link
Collaborator

ross-spencer commented Apr 27, 2022

that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ.

Some notes on what I wrote to Tyler:

The Wikidata documentation on the query service (WDQS) is here but it's not very clear, i.e. it talks about processing time, not how that translates to some large queries:

https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits

I have known that it is a risk that this might happen, though I can't quantify for how many and when. There are times for
example when I have been testing where I have run the query in-upwards of 30 times in a day.

We set a custom header for the request which should be recognized by WDQS and prevent this issue somewhat - it is more friendly to known user-agents for example than unknown ones.

Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data.

In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: #178 (PR just needs review and (and fixes) and merging).

EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked.

@EG-tech
Copy link
Author

EG-tech commented Apr 27, 2022

thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work!

@ross-spencer
Copy link
Collaborator

ah, thanks @EG-tech 🙂

@ross-spencer
Copy link
Collaborator

-update flag now supports Wikidata which should provide a workaround for most facing this issue, there's an underlying reliability issue that might still be solved here as per above.

@ross-spencer
Copy link
Collaborator

NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast.

@ross-spencer
Copy link
Collaborator

cc. @thorsted

Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below:

(I don't think this is the way to go but it's useful to know about)

@ross-spencer
Copy link
Collaborator

@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance?

SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
  # Return records of type File Format or File Format Family (via instance or subclass chain):
  { ?uri wdt:P31/wdt:P279* wd:Q235557 }.
      
  # Only return records that have at least one useful format identifier
  FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.       
  
  OPTIONAL { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures
  OPTIONAL { ?uri wdt:P1195 ?extension. }          # File extension
  OPTIONAL { ?uri wdt:P1163 ?mimetype.  }          # IANA Media Type
  OPTIONAL { ?uri p:P4152 ?object;                 # Format identification pattern statement
    OPTIONAL { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding
    OPTIONAL { ?object ps:P4152 ?sig.        }     # We always have a signature
    OPTIONAL { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file
    OPTIONAL { ?object pq:P4153 ?offset.     }     # Offset relative to the relativity
    OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
       OPTIONAL { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri

@thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably?

(I can create a test binary too)

via digipres/digipres.github.io#48 (comment)

@ross-spencer
Copy link
Collaborator

ross-spencer commented Jul 11, 2024

nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152 [] }.

@anjackson
Copy link
Contributor

Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The wdt:P31/wdt:P279* wd:Q235557 seems to be missing out some records (e.g. no *.psd!) , and I'm seeing different variations in different places (wdt:P31*/wdt:P279*, p:P31/ps:P31/wdt:P279*) which I can't say I fully understand at this point.

But the FILTER thing seems to help with the overall size/performance.

@ross-spencer
Copy link
Collaborator

@anjackson there was some explanation of these patterns here ffdev-info/wikidp-issues#24 (comment) via @BertrandCaron that may be helpful?

re: the PSD issue, this is why you included the UNION of file format family? did it work?

@anjackson
Copy link
Contributor

@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of File Format Family but not of File Format (other families appear to be explicitly declared as instances of both). But so did using P31* instead of UNION, as a File Format Family is an instance of a File Format. At the time of writing, UNION matches 69,961 (un FILTERed records) and P31* matches 70,363 so something else is going on too. This is what I'm attempting to write up.

@anjackson
Copy link
Contributor

FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/

@ross-spencer
Copy link
Collaborator

ross-spencer commented Jan 1, 2025

In NYC this week, I am continuing to see this issue with the patches applied to roy:

$ time ./roy harvest -wikidata
2025/01/01 15:46:51 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2025/01/01 15:46:51 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2025/01/01 15:46:51 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2025/01/01 15:47:43 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

real    0m51.139s
user    0m12.959s
sys     0m7.422s

siglen 8

$ time ./roy harvest -wikidata -siglen 8
2025/01/01 15:48:40 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2025/01/01 15:48:40 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2025/01/01 15:48:40 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2025/01/01 15:49:48 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

real    1m8.825s
user    0m11.941s
sys     0m7.604s

siglen 10:

$ time ./roy harvest -wikidata -siglen 10
2025/01/01 15:52:34 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2025/01/01 15:52:34 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2025/01/01 15:52:34 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2025/01/01 15:53:17 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

real    0m42.618s
user    0m9.924s
sys     0m6.005s

siglen 12:

$ time ./roy harvest -wikidata -siglen 12
2025/01/01 15:55:36 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2025/01/01 15:55:36 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2025/01/01 15:55:36 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2025/01/01 15:56:20 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

real    0m44.860s
user    0m9.791s
sys     0m5.597s

Not great. I will see if i can generate more logging.

@ross-spencer
Copy link
Collaborator

I think I have it.

I have a utility for running wikidata queries like shell scripts using golang here: https://github.com/ross-spencer/spargo

I found out the standard query is very quick.

$ time ./wd.sparql | grep entity | sort | uniq | wc -l
Connecting to: https://query.wikidata.org/sparql

Query: # Return all file format records from Wikidata.
SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig WHERE {
  { ?uri (wdt:P31/(wdt:P279*)) wd:Q235557. }
  UNION
  { ?uri (wdt:P31/(wdt:P279*)) wd:Q26085352. }
  FILTER(EXISTS { ?uri (wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152) _:b2. })
  FILTER((STRLEN(?sig)) >= 6 )
  OPTIONAL { ?uri wdt:P2748 ?puid. }
  OPTIONAL { ?uri wdt:P1195 ?extension. }
  OPTIONAL { ?uri wdt:P1163 ?mimetype. }
  OPTIONAL {
    ?uri p:P4152 ?object.
    OPTIONAL { ?object pq:P3294 ?encoding. }
    OPTIONAL { ?object ps:P4152 ?sig. }
    OPTIONAL { ?object pq:P2210 ?relativity. }
    OPTIONAL { ?object pq:P4153 ?offset. }
    OPTIONAL {
      ?object prov:wasDerivedFrom ?provenance.
      OPTIONAL {
        ?provenance pr:P248 ?reference;
          pr:P813 ?date.
      }
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY (?uri)


8482

real    0m1.535s
user    0m1.222s
sys     0m0.455s

It looks like the hang-up in roy is retrieving user provenance, e.g.

https://www.wikidata.org/w/api.php?action=query&format=json&prop=revisions&rvlimit=5&rvprop=ids%7Cuser%7Ccomment%7Ctimestamp%7Csha1&titles=item%3AQ35221946

e.g.

───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: inspect.example
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ Q35221946
   2   │ Name: 'RAR, version 5'
   3   │ MIMEType: 'application/x-rar-compressed; application/vnd.rar'
   4   │ Sources: 'Gary Kessler's File Signature Table (source date: 2017-08-07) PRONOM (Official (fmt/613))' 
   5   │ Revision History: {
   6   │   "Title": "Q35221946",
   7   │   "Revision": 2209073726,
   8   │   "Modified": "2024-07-20T23:13:41Z",
   9   │   "Permalink": "https://www.wikidata.org/w/index.php?oldid=2209073726&title=Q35221946",
  10   │   "History": [
  11   │     "2024-07-20T23:13:41Z (oldid: 2209073726): 'Vitplister' edited: '/* wbremoveclaims-remove:1| */ [[Property:P4152]]: 526172211A070100'",
  12   │     "2024-07-20T23:13:34Z (oldid: 2209073671): 'Vitplister' edited: '/* wbsetclaim-update:2||1|1 */ [[Property:P4152]]: 526172211A070100'",
  13   │     "2024-07-20T23:13:07Z (oldid: 2209073493): 'Vitplister' edited: '/* wbsetclaim-update:2||1|1 */ [[Property:P4152]]: 526172211A070100'",
  14   │     "2023-04-02T15:05:39Z (oldid: 1866796586): 'Renamerr' edited: '/* wbsetdescription-add:1|uk */ формат файлу, [[:toollabs:quickstatements/#/batch/151018|batch #151018]]'",
  15   │     "2021-03-10T14:16:29Z (oldid: 1379273425): 'YULdigitalpreservation' edited: '/* wbeditentity-update:0| */ WikiDP Trid Signatures Single([[:toollabs:editgroups/b/CB/a6d7c6
       │ d0301|details]])'"
  16   │   ]
  17   │ }
  18   │ ---
  19   │ Signatures:
  20   │ globs: *.rar
  21   │ sigs: (B:0 seq "Rar!\x1a\a\x01\x00")
  22   │       (B:0 seq "Rar!\x1a\a\x01\x00")
  23   │ superiors: none
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

NB. The history array on line 10 is generated by Wikibase's native API not Wikidata.

Via this Stack Overflow post there's some basic web-scraping good behaviour I could be implementing: https://stackoverflow.com/a/62396996

And if I add more logging to wikiprov which is doing this work, I see that the responses do have retry-after set after a number of attempts:

2025/01/01 17:11:41 request &{GET https://www.wikidata.org/w/api.php?action=query&format=json&prop=revisions&rvlimit=5&rvprop=ids%7Cuser%7Ccomment%7Ctimestamp%7Csha1&titles=item%3AQ105851857 HTTP/1.1 1 1 map[User-Agent:[wikiprov/0.2.0 (https://github.com/ross-spencer/wikiprov/; all.along.the.watchtower+github@gmail.com)]] <nil> <nil> 0 [] false www.wikidata.org map[] map[] <nil> map[]   <nil> <nil> <nil> {{}} <nil> [] map[]}
2025/01/01 17:11:41 header: map[Content-Length:[1854] Content-Type:[text/html; charset=utf-8] Date:[Wed, 01 Jan 2025 22:11:42 GMT] Nel:[{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}] Report-To:[{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }] Retry-After:[1] Server:[Varnish] Server-Timing:[cache;desc="int-front", host;desc="cp1114"] Set-Cookie:[WMF-Last-Access=01-Jan-2025;Path=/;HttpOnly;secure;Expires=Sun, 02 Feb 2025 12:00:00 GMT WMF-Last-Access-Global=01-Jan-2025;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Sun, 02 Feb 2025 12:00:00 GMT] Strict-Transport-Security:[max-age=106384710; includeSubDomains; preload] X-Cache:[cp1114 int] X-Cache-Status:[int-front] X-Client-Ip:[64.92.67.130]]
2025/01/01 17:11:41 header: [1]

Here it is set to 1 which is a small number for the number of requests we're making. We're also not honouring this.

I think there are more than a few things we can do here, as well as a simple "no provenance" option.

If we add a sleep and retry based on the retry-after value we can make this pass pretty quickly:

2025/01/01 17:31:59 ------- retrying ---------
2025/01/01 17:31:59 1
2025/01/01 17:31:59 ------- retrying ---------
2025/01/01 17:31:59 1
2025/01/01 17:32:00 Roy (Wikidata): Harvesting Wikidata definitions '/home/ross/.local/share/siegfried/wikidata/wikidata-definitions-3.0.0' complete

real    1m46.684s
user    0m23.893s
sys     0m17.254s
diff --git a/pkg/wikiprov/wikiprov.go b/pkg/wikiprov/wikiprov.go
index 85d6ab8..9c27f26 100644
--- a/pkg/wikiprov/wikiprov.go
+++ b/pkg/wikiprov/wikiprov.go
@@ -24,6 +24,10 @@ import (
        "io/ioutil"
        "net/http"
        "strings"
+
+       "log"
+       "strconv"
+       "time"
 )
 
 func getRevisionProperties() string {
@@ -91,6 +95,16 @@ func GetWikidataProvenance(id string, history int) (Provenance, error) {
                return Provenance{}, err
        }
 
+       if len(resp.Header["Retry-After"]) > 0 {
+               retry, _ := strconv.Atoi(resp.Header["Retry-After"][0])
+               if retry > 0 {
+                       log.Println("------- retrying ---------")
+                       log.Println(retry)
+                       time.Sleep(time.Duration(retry))
+                       return GetWikidataProvenance(id, history)
+               }
+       }
+
        const expectedCode int = 200
        if resp.StatusCode != expectedCode {
                responseError := ResponseError{}

If we're a bit nicer to the Wikidata's Wikibase API it's really nice to us so we should be able to do this pretty easily.

I'll probably:

  • implement no prov as this is an as-yet underused function.
  • implement a retry-after approach like above.
  • add some better return information to delineate when an error is with Wikidata and the Wikibase API.
  • Have a look at anything else that might help along the way. Suggestions welcome.

And I think this will fix things.

Apologies for the time wasted on this, I wasn't even looking in the Wikibase's API direction and didn't think for a second it would be interfering. I'm not sure why I was unable to recreate the issue in my regional virtual machines but maybe the internet connection does still have some impact, e.g. a slower connection maybe has fewer attempts and so doesn't trigger the issue?

Will be good to fix this. Hopefully have something up soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants