-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding of HTML entities in links #383
Conversation
generic_getLinks() doesn't decode HTML entities. Besides it doens't parse HTML and therefore may extract false links.
Current version of normalize_link() discards the query and/or fragment components of a URL.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #383 +/- ##
==========================================
+ Coverage 27.49% 28.06% +0.57%
==========================================
Files 26 26
Lines 2550 2576 +26
Branches 1356 1371 +15
==========================================
+ Hits 701 723 +22
- Misses 1368 1369 +1
- Partials 481 484 +3 ☔ View full report in Codecov by Sentry. |
Code LGTM. But I have a doubt about limiting to the four |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veloman-yunkan Any reason why '
has not been implemented like described in the ticket? The probably of something like href='http://www.kiwix.org/j'aime'
seems quite high to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veloman-yunkan LGTM (thx for your quick fix)
@kelson42 My fault. Fixed now. |
@kelson42 Before merging, it always makes sense to have fixup commits (if any) squashed into their respective commits (via a rebase). |
My bad, sorry |
Fixes #378
URI-decoding of links extracted from the HTML was performed, so I only had to add handling of HTML entities. I did so only for the syntactically important characters
&
,<
,>
and"
.