-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rel="canonical" for search engines #65
Labels
help wanted
P0
Critical: Tackled by core team ASAP
snapshots
issues related to snapshot creation and updates
Comments
lidel
added
help wanted
snapshots
issues related to snapshot creation and updates
labels
Sep 27, 2019
This was referenced Sep 27, 2019
lidel
changed the title
Add <link rel="canonical" if missing
Add rel="canonical" for search engines
Oct 28, 2019
I've updated description of this task and added |
@lidel We have re-generated the WPTR ZIM file, so such old files won't be longer provided by us soon. |
lidel
added a commit
to lidel/zim-tools
that referenced
this issue
Feb 27, 2021
Without this search engines crawling unpacked version may produce duplicated results. It does not cost much to have it, and avoids issues like ones linked below. Ref. openzim/mwoffliner#963 ipfs/distributed-wikipedia-mirror#65 https://en.wikipedia.org/wiki/Canonical_link_element
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
help wanted
P0
Critical: Tackled by core team ASAP
snapshots
issues related to snapshot creation and updates
While openzim/mwoffliner#963 solves problem for new snapshots, it is still possible the script will be run against an old ZIM without the header.
There are two ways to solve this:
(A) HTML tag
Before adding to IPFS the script should check if root document contains the header, and if not manually add it to every document.
Note: This is pretty expensive check/step that needs to happen after/during unpacking of ZIM archive and includes potential mutation of original file. It may not be feasible for 650GB archive of english wikipedia, which already takes long time to build.
(B) HTTP header
Instead of mutating every page, we could configure address the problem at Nginx level. Search engines are crawling
*.wikipedia-on-ipfs.org
, so we could simply add the rel=canonical HTTP header:Given the fact that HTML tag will be added to newly created ZIM snapshots when openzim/mwoffliner#963 lands, HTTP header sounds like a good temporary measure.
The task here would be to come up with dynamic Nginx rule that adds
Link
with URL matching current one, but without.html
and-on-ipfs
.The text was updated successfully, but these errors were encountered: