Add rel="canonical" for search engines #65

lidel · 2019-09-27T07:38:18Z

Context: #48 (comment)

While openzim/mwoffliner#963 solves problem for new snapshots, it is still possible the script will be run against an old ZIM without the header.

There are two ways to solve this:

(A) HTML tag

Before adding to IPFS the script should check if root document contains the header, and if not manually add it to every document.

Note: This is pretty expensive check/step that needs to happen after/during unpacking of ZIM archive and includes potential mutation of original file. It may not be feasible for 650GB archive of english wikipedia, which already takes long time to build.

(B) HTTP header

Instead of mutating every page, we could configure address the problem at Nginx level. Search engines are crawling *.wikipedia-on-ipfs.org, so we could simply add the rel=canonical HTTP header:

$ curl -I https://en.wikipedia-on-ipfs.org/wiki/Vincent_van_Gogh.html                                                                                                                   130 ...ikipedia-mirror master
HTTP/2 200
Content-Type: text/html; charset=utf-8
Content-Length: 457709
Link: <https://en.wikipedia.org/wiki/Vincent_van_Gogh>; rel="canonical"

Given the fact that HTML tag will be added to newly created ZIM snapshots when openzim/mwoffliner#963 lands, HTTP header sounds like a good temporary measure.

The task here would be to come up with dynamic Nginx rule that adds Link with URL matching current one, but without .html and -on-ipfs.

The text was updated successfully, but these errors were encountered:

lidel · 2019-10-28T10:52:31Z

I've updated description of this task and added (B) HTTP header which may be more feasible way to address this short term.

kelson42 · 2019-11-07T17:55:54Z

@lidel We have re-generated the WPTR ZIM file, so such old files won't be longer provided by us soon.

lidel · 2021-02-15T15:41:37Z

I confirmed the header is present in the latest snapshot handled since #77:

  <link rel="canonical" href="https://tr.wikipedia.org/wiki/Anasayfa">

Let's continue in #60 #61

Without this search engines crawling unpacked version may produce duplicated results. It does not cost much to have it, and avoids issues like ones linked below. Ref. openzim/mwoffliner#963 ipfs/distributed-wikipedia-mirror#65 https://en.wikipedia.org/wiki/Canonical_link_element

lidel added help wanted snapshots issues related to snapshot creation and updates labels Sep 27, 2019

This was referenced Sep 27, 2019

Block internet search engines from indexing the mirror #48

Closed

Update en.wikipedia-on-ipfs.org #61

Closed

Update tr.wikipedia-on-ipfs.org #60

Closed

momack2 mentioned this issue Sep 27, 2019

Add all the other wikipedia snapshots #63

Open

17 tasks

lidel added the P0 Critical: Tackled by core team ASAP label Oct 27, 2019

lidel changed the title ~~Add <link rel="canonical" if missing~~ Add rel="canonical" for search engines Oct 28, 2019

derhuerst mentioned this issue Oct 28, 2019

serve with canonical link to "normal" Wikipedia derhuerst/wikipedia-feed-ui#1

Open

lidel mentioned this issue Dec 22, 2019

fix: add canonical link ipfs/ipfs-docs#50

Merged

lidel closed this as completed Feb 15, 2021

lidel mentioned this issue Feb 27, 2021

fix(zimdump): add canonical links to redirects openzim/zim-tools#228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rel="canonical" for search engines #65

Add rel="canonical" for search engines #65

lidel commented Sep 27, 2019 •

edited

Loading

lidel commented Oct 28, 2019

kelson42 commented Nov 7, 2019

lidel commented Feb 15, 2021

Add rel="canonical" for search engines #65

Add rel="canonical" for search engines #65

Comments

lidel commented Sep 27, 2019 • edited Loading

(A) HTML tag

(B) HTTP header

lidel commented Oct 28, 2019

kelson42 commented Nov 7, 2019

lidel commented Feb 15, 2021

lidel commented Sep 27, 2019 •

edited

Loading