Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rel="canonical" for search engines #65

Closed
lidel opened this issue Sep 27, 2019 · 3 comments
Closed

Add rel="canonical" for search engines #65

lidel opened this issue Sep 27, 2019 · 3 comments
Labels
help wanted P0 Critical: Tackled by core team ASAP snapshots issues related to snapshot creation and updates

Comments

@lidel
Copy link
Member

lidel commented Sep 27, 2019

Context: #48 (comment)

While openzim/mwoffliner#963 solves problem for new snapshots, it is still possible the script will be run against an old ZIM without the header.

There are two ways to solve this:

(A) HTML tag

Before adding to IPFS the script should check if root document contains the header, and if not manually add it to every document.

Note: This is pretty expensive check/step that needs to happen after/during unpacking of ZIM archive and includes potential mutation of original file. It may not be feasible for 650GB archive of english wikipedia, which already takes long time to build.

(B) HTTP header

Instead of mutating every page, we could configure address the problem at Nginx level. Search engines are crawling *.wikipedia-on-ipfs.org, so we could simply add the rel=canonical HTTP header:

$ curl -I https://en.wikipedia-on-ipfs.org/wiki/Vincent_van_Gogh.html                                                                                                                   130 ...ikipedia-mirror master
HTTP/2 200
Content-Type: text/html; charset=utf-8
Content-Length: 457709
Link: <https://en.wikipedia.org/wiki/Vincent_van_Gogh>; rel="canonical"

Given the fact that HTML tag will be added to newly created ZIM snapshots when openzim/mwoffliner#963 lands, HTTP header sounds like a good temporary measure.

The task here would be to come up with dynamic Nginx rule that adds Link with URL matching current one, but without .html and -on-ipfs.

@lidel lidel added help wanted snapshots issues related to snapshot creation and updates labels Sep 27, 2019
@lidel lidel added the P0 Critical: Tackled by core team ASAP label Oct 27, 2019
@lidel lidel changed the title Add <link rel="canonical" if missing Add rel="canonical" for search engines Oct 28, 2019
@lidel
Copy link
Member Author

lidel commented Oct 28, 2019

I've updated description of this task and added (B) HTTP header which may be more feasible way to address this short term.

@kelson42
Copy link

kelson42 commented Nov 7, 2019

@lidel We have re-generated the WPTR ZIM file, so such old files won't be longer provided by us soon.

@lidel
Copy link
Member Author

lidel commented Feb 15, 2021

I confirmed the header is present in the latest snapshot handled since #77:

  <link rel="canonical" href="https://tr.wikipedia.org/wiki/Anasayfa">

Let's continue in #60 #61

@lidel lidel closed this as completed Feb 15, 2021
lidel added a commit to lidel/zim-tools that referenced this issue Feb 27, 2021
Without this search engines crawling unpacked version may produce duplicated results.
It does not cost much to have it, and avoids issues like ones linked below.

Ref.
openzim/mwoffliner#963
ipfs/distributed-wikipedia-mirror#65
https://en.wikipedia.org/wiki/Canonical_link_element
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted P0 Critical: Tackled by core team ASAP snapshots issues related to snapshot creation and updates
Projects
None yet
Development

No branches or pull requests

2 participants