-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Filtering through Python wrapper #222
Comments
It seems like the CLI already has this option (-LF) but since we're using
the Python wrapper this functionality doesn't seem available to use yet.
Are we overlooking a certain feature that might help us filter the
categories, or is this feature yet to be added? Even though we're not that
well versed in C++, we might attempt adding the functionality ourselves
(and submitting a pull request), but we would appreciate some advice on
whether you think it's feasible and the amount of work it would likely
require.
Hi Jiri,
It's a joy seeing KNMI here :) Just for info, I'm using the CLI with -LF
option and forwarding the filtered output to a message queue (mqtt,
rabbitmq) so it can be offloaded and distributed to several hosts that can
process in realtime. It scales really well. I haven't explored the python
wrapper.
Kind regards,
…--
diego dot torres at gmail dot com - Madrid / Spain
|
Thanks for sharing diego! I've been in contact with @dsalantic as well, and he suggested doing the file reading/seeking in Python and then passing relevant bytes to the Asterix parser. I've been working on implementing this, but preliminary results seem pretty good so far. I will update this thread when we've finalized our solution. |
Hi @JiriBakker,
However, good asterix processing performance could be achived in python too. Could you please share a short sample file (input), together with the expected result (output)? I would be interested in comparing the pure python implementation with your existing processing pipeline or with the optimized implementation that your are up to. Zoran |
@zoranbosnjak Thanks for the input! For now we've implemented the optimization of filtering per category. Below a sample of how we implemented this: from pathlib import Path
from typing import Any, Generator,
import asterix
def generate_data_item_stream(path: Path, allowed_categories: list[int]) -> Generator[dict[str, Any], None, None]:
with open(path, "rb") as file:
while (first_three_bytes := file.read(3)):
if first_three_bytes == "":
break
category: int = first_three_bytes[0]
length: int = first_three_bytes[1] * 256 + first_three_bytes[2]
if category not in allowed_categories:
file.seek(length - 3, 1)
continue
data_block: bytes = first_three_bytes + file.read(length - 3)
# Because of backwards compatibility with older Asterix formats (v2.1 and earlier) a single data block
# can contain one or multiple data items. This is the reason why `asterix.parse()` will return a
# list instead of a single item. Note that within a data block all data items are guaranteed to have
# the same category, so we can do the category filtering on data block level.
data_items: list[dict[str, Any]] = asterix.parse(data = data_block, verbose = False)
for data_item in data_items:
yield data_item So far this works fairly well for us. Files that contain multiple categories, some of which we are not interested in, are processed faster than previously. We'll monitor the performance once we start processing larger amounts of data. If we have any additional findings we'll be sure to share them here. @dsalantic I'll leave it up to you whether or not you want to close this issue. For now, the above solution is sufficient for us. Thanks again for the assistance! |
Hi,
First of all, thanks so much for making this tool available publicly. At Royal Dutch National Meteorological Institute (KNMI) we are making good use of it, so we're very happy that we're able to do so.
Currently we're trying to improve the performance of our overall pipeline that is using the Asterix decoding through the Python wrapper. One of the options we would like to explore is to see if we can reduce the processing time of the Asterix file by filtering out data items of categories that we are not interested in.
It seems like the CLI already has this option (
-LF
) but since we're using the Python wrapper this functionality doesn't seem available to use yet. Are we overlooking a certain feature that might help us filter the categories, or is this feature yet to be added? Even though we're not that well versed in C++, we might attempt adding the functionality ourselves (and submitting a pull request), but we would appreciate some advice on whether you think it's feasible and the amount of work it would likely require.Thanks in advance!
The text was updated successfully, but these errors were encountered: