-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate SPARQL query generation to Wikidata by items with P #40
Comments
Okay, this one is a generic query SELECT DISTINCT ?item ?itemLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
{
SELECT DISTINCT ?item WHERE {
{
?item p:P1585 ?statement0.
?statement0 (ps:P1585) _:anyValueP1585.
#FILTER(EXISTS { ?statement0 prov:wasDerivedFrom ?reference. })
}
}
}
} However, we already query the human languages, (but we can workaroud it). Maybe this feature will be somewhat hardcoded, because if we implement at full potential, it would means also read the 1603_1_7 and "undestand" what each P means. |
Great. We managed to use a pre-processor to create the queries (--lingua-divisioni=18 --lingua-paginae=1 and are paginators). I think the brazilian cities may be one of those codex with over 200 languages. Current working example
SELECT (STRAFTER(STR(?item), "entity/") AS ?item__conceptum__codicem) ?item__rem__i_ara__is_arab ?item__rem__i_hye__is_armn ?item__rem__i_ben__is_beng ?item__rem__i_rus__is_cyrl ?item__rem__i_hin__is_deva ?item__rem__i_amh__is_ethi ?item__rem__i_kat__is_geor ?item__rem__i_grc__is_grek ?item__rem__i_guj__is_gujr ?item__rem__i_pan__is_guru ?item__rem__i_kan__is_knda ?item__rem__i_kor__is_hang ?item__rem__i_lzh__is_hant ?item__rem__i_heb__is_hebr ?item__rem__i_khm__is_khmr WHERE {
{
SELECT DISTINCT ?item WHERE {
?item p:P1585 ?statement0.
?statement0 (ps:P1585 ) _:anyValueP1585 .
}
}
OPTIONAL { ?item rdfs:label ?item__rem__i_ara__is_arab filter (lang(?item__rem__i_ara__is_arab) = "ar"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_hye__is_armn filter (lang(?item__rem__i_hye__is_armn) = "hy"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_ben__is_beng filter (lang(?item__rem__i_ben__is_beng) = "bn"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_rus__is_cyrl filter (lang(?item__rem__i_rus__is_cyrl) = "ru"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_hin__is_deva filter (lang(?item__rem__i_hin__is_deva) = "hi"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_amh__is_ethi filter (lang(?item__rem__i_amh__is_ethi) = "am"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kat__is_geor filter (lang(?item__rem__i_kat__is_geor) = "ka"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_grc__is_grek filter (lang(?item__rem__i_grc__is_grek) = "grc"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_guj__is_gujr filter (lang(?item__rem__i_guj__is_gujr) = "gu"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_pan__is_guru filter (lang(?item__rem__i_pan__is_guru) = "pa"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kan__is_knda filter (lang(?item__rem__i_kan__is_knda) = "kn"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kor__is_hang filter (lang(?item__rem__i_kor__is_hang) = "ko"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_lzh__is_hant filter (lang(?item__rem__i_lzh__is_hant) = "lzh"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_heb__is_hebr filter (lang(?item__rem__i_heb__is_hebr) = "he"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_khm__is_khmr filter (lang(?item__rem__i_khm__is_khmr) = "km"). }
bind(xsd:integer(strafter(str(?item), 'Q')) as ?id_numeric) .
}
ORDER BY ASC (?id_numeric) Potential annoying issueQueries with so much itens varies a lot the runtime. Even with pagination, sometimes it go over 40 seconds (but if cached to 5 seconds). So I think we will definitely need to make some rudimentary way to on bash functions to check if timeouted and ajust the times for try again. Still to doHowever, we maybe will need to create more than one query, because this strategy (to merge with datasets) would require we already know upfront what Wikidata Q is linked to IBGE code. |
Current working exampleQuery building of only interlingual codes (can be used as key to merge translations)
SELECT (?wikidata_p_value AS ?item__conceptum__codicem) (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq) WHERE {
{
SELECT DISTINCT ?item WHERE {
?item p:P1585 ?statement0.
?statement0 (ps:P1585 ) _:anyValueP1585 .
}
}
?item wdt:P1585 ?wikidata_p_value .
}
ORDER BY ASC (?wikidata_p_value) Image of generated csv (proof of concept; 3 merged parts of 20; need automate the rest and fix order of columns)temp directoryWe did not even merged the 17 pages from 20 of languages and the file size already is 1,6MB. No idea how big this will be with all languages. |
…s --cum-interlinguis=P1,P2... (MVP of generation of query adding more attributes)
…s --cum-interlinguis=P1,P2... (bugfix; duplicated related atributes now concatenate with |)
…ata_p_ex_linguis, (draft) wikidata_p_ex_totalibus
Performance issuesHumm... now the issue is do heavy optimization on the queries to mitigate the timeouts. The ones with over 5000 concepts and even splitting 20 parts the 1603_1_51 languages are the issue here. Maybe one strategy would be allow removing the |
… ordering on linguistic query to mitigate timeouts
bash wikidata_p_ex_totalibus()The bash helper, while still need more testing, somewhat already deal with retrying again. For something such as P1585 it using now 1 + 20 queries. However, later we obviously should get data from primary sources (in case of IBGE, I think https://servicodados.ibge.gov.br/api/docs/localidades do it) and use as primary reference, potentially validating information from Wikidata GeneralizationsTurns out that we can start to bootstrapping other tables (the ones already perfect on Wikidata) the same way done with IBGE municipalities, including translations on several languages! However, the same ideal approaches (such as rely on primary sources, then increment with Wikidata) would somewhat apply too. Sometimes this may not be really relevant. For example, something not stricly a place (like the Also, eventually we will need to think like somewhat as an Ontology, otherwise the #41 would be as efficient for general users. Print screen |
…ickly boostrap tables if reference P already not fully numeric (like brazilian URN Lex and CNPJ)
Already implemented and used in practice. Closing for now. |
One item from #39, the
P1585 https://www.wikidata.org/wiki/Property:P1585 //Dicionários de bases de dados espaciais do Brasil//@por-Latn
actually is very well documented on Wikidata, so we would not need to fetch Wikidata Q one by one.It's a rare case something so perfect, but the idea here would be create an additional option on ./999999999/0/1603_3_12.py to create the SPARQL query for us.
This obviously will need pagination. If with ~300 Wikidata Q we already timeout with over 250 languages on 1603_1_51 (for now using 5 batches), with sometime with 5700 items, well, this will be fun
The text was updated successfully, but these errors were encountered: