-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use simdjson to read WAT payloads #41
Comments
Do we want to exclusively support pysimdjson, or should we consider implementing adapter classes to support multiple parsers? This would allow users to switch parsers at runtime, similar to what is proposed for HTML parsers in PR #47 . |
As Sebastian alludes to by mentioning #34, we like having a fallback. I see that pysimdjson claims that it has a fallback internally, but you can still run into weirdnesses like a package lacking a wheel causing problems in CI. |
@wumpus Thanks for the comment! I wanted to add a key point that These APIs, such as Summary of Options
This way, we balance compatibility and performance while letting users decide what works best for their needs. I prefer Option 3 as it is a much cleaner approach if going for performance Thoughts? |
Hi @silentninja,
Yes. Looks like that simdjson does not support every combination of the matrix (OS, platform), see https://pysimdjson.tkte.ch/: Mac OS on ARM is not supported. This already happened in the past with ujson (#34). A working fall-back is always required.
No. It adds extra complexity on every WAT record and does not the maximum performance, see the notes about re-using the simdjson parser.
This can still be achieved based on inheritance. For example, several example classes have a variant using FastWARC instead of warcio, see #37/#38. But if you want it simple or stay compatible, you can always use the classes based on warcio. Of course, with regard to simdjson, it only makes sense to implement a performant solution for classes which consume JSON resp. WAT files, and are used not only as simple example. So, I could imagine to implement it in ExtractHostLinksFastWarcJob because this class is used by Common Crawl every month to extract host-level links to span up the web graph. |
Just noting that although it doesn't appear in the grid, it is supported and universal ARM/x86 wheels are published. Oversight on my part, grid will be updated with the next release. If portability is your primary concern, https://github.com/tktech/py_yyjson performs much the same as pysimdjson while being standard C89, and has binary wheels available for all platforms with wheel tags. See https://github.com/tktech/json_benchmark for comparisons of most popular parser. |
Simdjson (pysimdjson) should be faster than ujson when parsing WAT payloads. Could be worth to use it as a drop-in replacement if installed (cf. #34 regarding ujson replacing the built-in json module).
The text was updated successfully, but these errors were encountered: