-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Configurable HTML Parser Wrappers for BeautifulSoup and Resiliparse #47
base: main
Are you sure you want to change the base?
Conversation
We can enhance this further by adding support for third-party HTML parser wrappers. The idea is to allow users to pass the path to a custom wrapper module as a runtime argument. This approach would enable developers to use community-contributed libraries without requiring modifications to the existing codebase. Proposed Implementation
It would be worth considering this when there is an interest in third-party wrappers |
I like the idea of a wrapper that makes it easy to switch parsers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @silentninja, thanks for your contribution! I'll try it out soon.
@@ -18,3 +18,8 @@ lxml | |||
#fastwarc | |||
# (tested with) | |||
#fastwarc==0.14.1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified that it it possible to install and use FastWARC and Resiliparse with different versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sebastian-nagel, thanks for the catch! Resiliparse has a strict version dependency on fastwarc and will throw up an error when installing incompatible versions. I will fix the tested version and add a comment in requirement.txt to highlight this
sparkcc.py
Outdated
@@ -71,7 +71,8 @@ def parse_arguments(self): | |||
|
|||
arg_parser.add_argument("input", help=self.input_descr) | |||
arg_parser.add_argument("output", help=self.output_descr) | |||
|
|||
arg_parser.add_argument("--html_parser", default="beautifulsoup", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one class which uses a HTML parser: CCIndexWordCountJob. Shouldn't the option be moved to the class were it is used?
If it's a global option, there may be confusion, such as "I thought WET files contain plain text from parsed HTML. Why I should specify a HTML parser?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a global option, there may be confusion
Yes that's correct. I will move it to that particular job.
Thanks for the review, @sebastian-nagel! I’ve made the changes. Could you please review them again? |
Fixes #43
This PR introduces wrapper classes that provide a uniform API for common operations across two HTML parsers:
BeautifulSoup
andResiliparse
. The goal is to simplify switching between the parsers and ensure consistency in usage patterns.Additionally, the PR adds support for configuring the HTML parser wrapper at runtime by introducing a new
--html_parser
option when submitting a Spark job. This allows users to specify their preferred parser dynamically during execution.